Lab: Differential Expression R. Gentleman, W, Huber, D. Scholtens, and A. von Heydebreck June 23, 2005 1 Motivation In this lab we will cover some of the basic principles of finding differentially expressed genes. It is based on two publications by ? and ?. There are many different ways to detect differentially expressed genes, Rather than prescribe a standard way of doing this, in this lab we will export a variety of these. The goal is to give you an overview of the existing ideas, so you can make an appropriate choice in your own analyses. 1.1 The gene-by-gene approach In current practice, differential expression analysis is generally done using a gene-by-gene approach, ignoring the dependencies between genes, e. g. their interrelationships in “pathways”. Clearly, this is not satisfactory, and it will change as we learn more. For the purpose of this lab, we cover the gene-by-gene approach. 1.2 Non-specific prefiltering Most microarrays contain probes for many more genes than will be differentially expressed. Indeed, one of the basic assumptions of normalization is that most genes are not differentially expressed. To alleviate the loss of power from the formidable multiplicity of gene-by-gene hypothesis testing, we advise that some form of non-specific prefiltering should be carried out. By non-specific we mean that it is done without reference to phenotype. Its (sole) aim is to remove from consideration that set of probes whose genes are not differentially expressed under any comparison. We have found it most useful to select genes on the basis of variability (?). Only the genes that show any variation across samples can potentially be differentially expressed between our groups of interest. 1.3 Fold-change versus t-test The simplest approach is to select genes using a fold-change criterion. This may be the only possibility in cases where few replicates are available. An analysis solely based on fold change however precludes the assessment of significance of observed differences in the presence of biological and experimental variation, which may differ from gene to gene. This is the main reason for using statistical tests to assess differential expression. In general, one might look at all sorts of differences between the distributions of a gene’s expression levels under different conditions. Most often, the location (e.g. mean or median) parameter is considered. This leads to the t-test and its variations. As we have discussed in the lectures, there 1 are also good reasons to consider other criteria, such as the partial area under the ROC curve (pAUC). One may distinguish between parametric tests, such as the t–test, and non-parametric tests, such as the Mann–Whitney test or permutation tests. Parametric tests usually have a higher power if the underlying model assumptions, such as Normality in the case of the t–test, are at least approximately fulfilled. Non–parametric tests have the advantage of making less stringent assumptions on the data–generating distribution. In many microarray studies however, a small sample size leads to insufficient power for non–parametric tests. A pragmatic approach in these situations is to employ parametric tests, but to use the resulting p–values cautiously to rank genes by their evidence for differential expression. 2 Non-specific filtering First load the Biobase package and the data set ALL (acute lymphoblastic leukemia) from the package of the same name. Since the data in ALL are large and phenotypically quite diverse, we reduce the cases down to a reasonable two group comparison. First, we select only the B-cell tumors, then among these, the ones that have either a negative or a positive status with respect to the BCR/ABL mutation. > library("Biobase") > library("ALL") > data("ALL") > names(pData(ALL)) [1] [5] [9] [13] [17] [21] "cod" "BT" "t(4;11)" "mol.biol" "ccr" "date last seen" "diagnosis" "remission" "t(9;22)" "fusion protein" "relapse" "sex" "CR" "cyto.normal" "mdr" "transplant" "age" "date.cr" "citog" "kinet" "f.u" > ALL$BT [1] B2 B2 B4 B1 B2 B1 [26] B2 B1 B2 B1 B2 B [51] B2 B3 B4 B3 B3 B3 [76] B2 B2 B1 B3 B4 B4 [101] T2 T T4 T2 T3 T3 [126] T3 T2 T Levels: B B1 B2 B3 B4 T B1 B B4 B2 T B1 B2 B3 B2 T2 B2 B2 B3 B3 T3 B2 B2 B1 B4 T2 B3 B1 B1 B4 T2 B3 B2 B1 B4 T2 B3 B2 B1 B1 T1 B2 B2 B3 B2 T4 B3 B2 B3 B2 T B B2 B3 B2 T2 B2 B4 B3 B1 T3 B3 B4 B3 B2 T2 B2 B2 B3 B T2 B3 B2 B3 B T2 T1 T2 T3 T4 > ALLB = ALL[, grep("^B", as.character(ALL$BT))] > ALLBCRNEG = ALLB[, ALLB$mol == "BCR/ABL" | ALLB$mol == "NEG"] > ALLBCRNEG$mol.biol = factor(ALLBCRNEG$mol.biol) 2 B2 B2 B3 T T2 B2 B4 B1 T3 T3 B2 B2 B3 T2 T3 B1 B1 B1 T2 T3 B1 B2 B4 T3 T2 Now, if we are going to filter on the basis of variability, we might first want to make sure that the variability is not dominated by its dependence on the level. If it were, then selecting on the basis of variability would be confounded with selection on the basis of level. There are good reasons not to use level for gene selection. To check for an association we plot row-wise means versus row-wise standard deviations and plot these together with a smoothed estimate of their regression. > library("vsn") > meanSdPlot(ALLBCRNEG) 1.5 0.5 1.0 sd 2.0 2.5 The result is shown in Figure 1. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2000 6000 10000 rank(mean) Figure 1: Row-wise means versus row-wise standard deviations of the ALL data. Exercise 1 a) Comment on the plot, do you think that the relationship between mean and standard deviation is sufficiently weak? b) Have a look at the manual page of meanSdPlot. What is the use of the ranks parameter? Presuming that we decide that the relationship is not very strong - we proceed. Our next step is to select some proportion of the genes with a relatively large spread. Let’s say the top 20%. > sds = esApply(ALL, 1, sd) > sel = (sds > quantile(sds, 0.8)) > ALLset1 = ALLBCRNEG[sel, ] A potential drawback of this approach is found in situations where we are interested in a phenotype that has relatively few members. In this case, a gene which is differentially expressed between that 3 group and the other(s) may not have a large overall standard deviation. How would you address this situation? At this point you may want to try and look at some heatmaps of the data to see if there are any obvious patterns. Consult the manual page of the function by typing: ?heatmap 3 Differential Expression In Bioconductor, the genefilter package allows you to easily select genes using a variety of filters. Additionally, for some tests and comparisons we have developed fast versions. These include rowttests, which perform a t-test for every row in a gene expression matrix, rowFtests, which does F -tests, and rowQ, which calculates a quantile for each row. We have selected a subset of the data with two prominent phenotypes, BCR/ABL and NEG and will now consider some different methods for finding differentially expressed genes. First, and perhaps easiest is to use a t-test. > library("genefilter") > tt = rowttests(ALLset1, "mol") > names(tt) [1] "statistic" "dm" "df" "p.value" Consult the manual page for rowttests for the meaning of the four different elements of the return value tt. Many practitioners have learned that small p-values do not always correspond to genes for which there have been large changes. Let us look at the so-called volcano plot. > plot(tt$statistic, abs(tt$dm), pch = ".", xlab = "t-statistic", + ylab = "absolute value of mean fold change") The result is shown in Figure 2. Exercise 2 Determine whether there is a small set of probes that correspond to differentially expressed genes, using the t-test results. 4 Multiple Testing One of the subject areas that has received a great deal of attention is that of multiple testing. We provide a brief introduction to the functionality in the multtest package. Many of the algorithms in the multtest package depend on random permutations of the samples. The number of permutations is controlled by the parameter B. For the purpose of these exercises, since the permutations may take a long time, we choose a rather small value of B. In real applications, you should use one that is 10 or 100 times larger. 4 1.5 1.0 0.5 0.0 absolute value of mean fold change −4 −2 0 2 4 6 8 t−statistic Figure 2: Volcano plot > > > > > library("multtest") cl = as.numeric(ALLset1$mol) - 1 resT = mt.maxT(exprs(ALLset1), classlabel = cl, B = 1000) ord = order(resT$index) rawp = resT$rawp[ord] The next figure shows the histogram of unadjusted permutation p-values, as given by the vector rawp. The high proportion of small p–values suggests that a substantial fraction of the genes are differentially expressed between the two groups. > hist(rawp, breaks = 50, col = "#B2DF8A") The result is shown in Figure 3. In order to control the family-wise error rate (FWER), that is, the probability of at least one false positive in the set of significant genes, we have used the permutation-based maxT-procedure of Westfall and Young (?), as implemented in the function mt.maxT. We obtain 34 genes with an adjusted p-value below 0.05: > sum(resT$adjp < 0.05) [1] 34 A comparison of this number to the height of the leftmost bar in the histogram suggests that we may be missing a large number of differentially expressed genes. The FWER is a very stringent criterion, and in some microarray studies, only few genes may be significant in this sense, even if many more are truly differentially expressed. A more liberal criterion is provided by the false 5 200 0 100 Frequency 300 400 Histogram of rawp 0.0 0.2 0.4 0.6 0.8 1.0 rawp Figure 3: Histogram of rawp, the unadjusted p-values. discovery rate (FDR), that is, the expected proportion of false positives among the genes that are called significant. We can use the procedure of ? as implemented in multtest to control the FDR (note however that this procedure makes certain assumptions on the dependence structure between genes): > res = mt.rawp2adjp(rawp, proc = "BH") > sum(res$adjp[, "BH"] < 0.05) [1] 220 5 limma A t-test analysis can also be conducted with functions of the limma package. First, we have to define the design matrix. One possibility is to use an intercept term that represents the mean log intensity of a gene across all samples (first column consisting of 1s), and to encode the difference between the two classes in the second column. > > > > library("limma") labels = as.numeric(ALLset1$mol == "BCR/ABL") design = cbind(mean = 1, diff = labels) design Next a linear model is fitted for every gene by the function lmFit, and Empirical Bayes moderation of the standard errors is done by the function eBayes. 6 > fit = lmFit(ALLset1, design) > fit = eBayes(fit) > topTable(fit, coef = "diff", adjust.method = "fdr") 156 1915 155 163 2066 2014 1262 437 1269 1366 ID 1636_g_at 39730_at 1635_at 1674_at 40504_at 40202_at 37015_at 32434_at 37027_at 37403_at M 1.100012 1.152527 1.202675 1.427212 1.181029 1.779378 1.032702 1.678550 1.348702 1.117721 A 9.196420 9.000049 7.897095 5.001771 4.244478 8.621443 4.330511 4.466311 8.444161 5.086540 t 9.033955 8.587774 7.338622 7.050134 6.664739 6.391779 6.242224 5.971866 5.805421 5.483354 P.Value 1.232012e-10 4.894886e-10 1.034346e-07 2.874800e-07 1.298466e-06 3.628723e-06 5.996667e-06 1.696739e-05 3.077880e-05 1.077006e-04 B 21.293157 19.341283 13.905543 12.667763 11.032194 9.889235 9.269402 8.161945 7.489382 6.210589 When you compare the resulting p-value with those from the parametric t-test, you will see that they are almost identical: > plot(-log10(tt$p.value), -log10(fit$p.value[, "diff"]), xlab = "-log10(p) from two-sample t-t + ylab = "-log10(p) from moderated t-test (limma)", pch = ".") > abline(c(0, 1), col = "red") 12 10 8 6 4 2 0 −log10(p) from moderated t−test (limma) The result is shown in Figure 4. 0 2 4 6 8 10 12 −log10(p) from two−sample t−test Figure 4: Comparison of p-values from the unmoderated and moderated t-test in a situation with a large number of samples in both groups. 7 Because of the large number of samples, the Empirical Bayes moderation is not so relevant here: in these data set the gene specific variance can well be estimated from the data of each gene. 5.1 Small sample sizes However, the Empirical Bayes moderation may be quite useful in cases with fewer replicates. Let us draw a subsample with 3 arrays from each group from our data: > subs = c(35, 65, 75, 1, 69, 71) > ALLset2 = ALLBCRNEG[, subs] > table(ALLset2$mol) BCR/ABL 3 NEG 3 We repeat the testing procedure in the same way as before, > tt2 = rowttests(ALLset2, "mol") > fit2 = eBayes(lmFit(ALLset2, design = design[subs, ])) and plot the results in Figure 5. 4 3 2 1 0 −log10(p) from moderated t−test (limma) > plot(-log10(tt2$p.value), -log10(fit2$p.value[, "diff"]), xlab = "-log10(p) from two-sample t + ylab = "-log10(p) from moderated t-test (limma)", pch = ".") > abline(c(0, 1), col = "red") 0 1 2 3 4 −log10(p) from two−sample t−test Figure 5: Comparison of p-values from the unmoderated and moderated t-test in a situation with a small number of samples in both groups. 8 Let us have a look at a gene which has a small p-value in the normal t–test but a large one in the moderated test: > g = which(tt2$p.value < 1e-04 & fit2$p.value[, "diff"] > 0.02) > plot(exprs(ALLset2)[g, ], pch = 16, col = as.numeric(ALLset2$mol)) The plot is shown in Figure 6. ● ● 5.8 5.7 exprs(ALLset2)[g, ] 5.9 ● 5.6 ● ● ● 1 2 3 4 5 6 Index Figure 6: Expression values of 36404at. This probe set has a highly significant p–value in the unmoderated t–test, but is unremarkable in the moderated test. Note the y–axis scaling: while the two groups look well-separated, the absolute difference is small. Most likely, this is a chance artifact. You can verify this by looking of the expression values of this probe set in the other samples that were not used here. 6 ROC The receiver operator curve (ROC) is a popular method for finding variables that discriminate between two groups ?. The package ROC can be used to compute these curves. In this section we look at some exercises using this package. We want to find marker genes that are specifically expressed in leukemias with the BCR/ABL translocation. At a specificity of at least 0.9, we would like to identify the genes with the best sensitivity for the BCR/ABL phenotype. This can be expressed by the partial area under the ROC curve (pAUC, we choose t0 = 0.1). To limit the computation time, we compute the pAUC statistic only for the first 100 probe sets. 9 > library("ROC") > dxrule.sca function (x, thresh) ifelse(x > thresh, 1, 0) dxrule.sca is an example for a simple one-marker classification rule: if the value of the marker is greater than a threshold, the case is assigned to group 1, otherwise to group 0. We now define a function myRule that encapsulates this rule in the way that it is used by the helper functions of the ROC package. > myRule = function(x) { + pAUC(rocdemo.sca(truth = labels, data = x, rule = dxrule.sca), + t0 = 0.1) + } > pAUC1s = esApply(ALLset1[1:100, ], 1, myRule) Next we will select the probe set with the maximal value of our pAUC statistic, and plot the corresponding ROC curve. > j = which.max(pAUC1s) > RC = rocdemo.sca(truth = labels, data = exprs(ALLset1)[j, ], + rule = dxrule.sca, caseLabel = "new case", markerLabel = geneNames(ALLset1)[j]) And the figure, > plot(RC, main = geneNames(ALLset1)[j]) The result is shown in Figure 7. 7 A Bayesian Approach A different approach to dealing with the multiplicity problem is to adopt a Bayesian perspective. Kendiorski et al. have produced the EBarrays package that provides some Bayesian modeling capabilites. The general idea is to estimate the posterior probability that genes are differentially expressed. We refer the reader to their papers and the vignette in the package for more details. Here, we concentrate on how to use the software. > library("EBarrays") The package requires a slightly peculiar way of specifying the groups of interest: > pattern = ebPatterns(c("1,1,1,1,1,1", "1,1,1,2,2,2")) > em = emfit(ALLset2, family = "GG", hypotheses = pattern, num.iter = 10) > empp = postprob(em, ALLset2) Exercise 3 Examine a histogram of these p-values – you may find it helpful to consult the manual page first. Consider comparing the EBarrays p-values with those computed using limma. Those computed by EBarrays correspond to the posterior probability of differential expression; how would you decide which genes are differentially expressed? Using your criteria, how would you compare that to the limma results. 10 0.6 0.4 0.0 0.2 sens: new case 0.8 1.0 1326_at ●●●●●● ●●●●● ● ●●●●●●●●● ● ●●● ●● ●● ● ●● ●●●●●● ●●●● ● ● ●● ●●● ● ●●● ●● ● ● ● ●●● ●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 1−spec: 1326_at Figure 7: ROC curve of a gene that separates well between BCR/ABL positive and negative tumors. > hist(empp[, "P2"], 100) > plot(-log10(fit2$p.value[, "diff"]), empp[, "P2"], pch = ".") 8 Comparisons In Sections 3 and 6 you have learned how to find differentially expressed genes using two different criteria: Location changes in the distribution of a gene’s expression and its receiver operating characteristic. Let us now compare these methods and focus on the influence of sample size. The code used in this section is rather advanced and needs the update of package genefilter to a version >= 1.7.7. You don’t necessarily have to reproduce it here, but try to understand what it does and take a look at the output in Figure 9. > library(reposTools) > update.packages2("genefilter", develOK = TRUE) In your updated version of genefilter you will now find the new function rowpAUCs. It does the same thing as functions rocdemo.sca and pAUC in package ROC, that is compute ROC curves and pAUCs, but it was designed to be much faster and memory efficient. It can deal with exprSets and works through them row-by-row. You may want to look into rowpAUCs’s documentation for further information. We want to see how the sample size affects the number of genes that are found to be differentially expressed in the two methods. For this, we need to divide our data set in random sample subsets and do our computations on these subsets. 11 0 0.6 0.0 0.2 0.4 empp[, "P2"] 6000 2000 Frequency 0.8 1.0 10000 Histogram of empp[, "P2"] 0.0 0.2 0.4 0.6 0.8 1.0 0 empp[, "P2"] 1 2 3 4 −log10(fit2$p.value[, "diff"]) Figure 8: Left: histogram of posterior probabilities for differential expression between the BCR/ABL positive and negative tumors. Right: Comparison of the posterior probabilities with the (negative logarithm of the) p-values from the limma analysis. First we define a function that produces a logical vector out of an exprSet. The length of the vector is the number of samples, and it is intended to describe a grouping of the samples. (We did a similar thing in section 5 to get the labels vector. Since we need this in the following steps more than once, we now make it a function.) > classlabel <- function(x) { + stopifnot(class(x) == "exprSet", "mol.biol" %in% colnames(pData(x))) + return(as.numeric(pData(x)$mol.biol == "BCR/ABL")) + } Now let us build wrapper functions arround rowttests > nrsel.ttest = function(x, pthresh = 0.05) { + pval = rowttests(x, "mol")$p.value + return(sum(pval < pthresh)) + } and rowpAUCs. > nrsel.pAUC = function(x, pAUCthresh = 0.03) { + pAUC = rowpAUCs(x, fac = classlabel(x), p = 0.1)$pAUC + return(sum(pAUC > pAUCthresh)) + } 12 Note that the choices of thresholds are, as always, somewhat arbitrary, and that the one for the t-test, pthresh, is not directly comparable to the one for rowpAUCs, pAUCthresh. What we need now is a function that does some resampling for various data set sizes. To further reduce computation time, we only take a subset of 500 randomly sampled genes from our data set (you can try using more). > resample <- function(selfun, groupsize = seq(6, 36, by = 6), + nrep = 25) { + ALLset3 <- ALLset1[sample(1:nrow(exprs(ALLset1)), 500), ] + n <- matrix(nrow = nrep, ncol = length(groupsize)) + for (i in seq(along = groupsize)) { + for (rep in 1:nrep) { + samplesubset <- c(sample(which(classlabel(ALLset3) == + 0), groupsize[i]), sample(which(classlabel(ALLset3) == + 1), groupsize[i])) + n[rep, i] <- do.call(selfun, args = list(x = ALLset3[, + samplesubset])) + cat(".") + } + cat("\n") + } + mns <- apply(n, 2, mean) + stds <- apply(n, 2, sd) + plot(groupsize, mns, pch = 16, col = "#c03030", ylim = c(min(mns + stds) - 5, max(mns + stds) + 5), xlab = "groupsize", + ylab = "selected no. of diff. exp. genes", main = selfun) + segments(groupsize, mns - stds, groupsize, mns + stds, col = "red") + } Run the function, once with nrsel.ttest and once with nrsel.pAUC. > par(mfrow = c(1, 2)) > resample("nrsel.ttest") > resample("nrsel.pAUC") Exercise 4 Which of the two criteria seems more reasonable? In this vignette you have seen a number of different ways to determine interesting genes - or genes that are differentially expressed. Compare some of the lists - what do you think about the amount of overlap? Do you have some idea how you might decide whether one list is better than some of the others? Another way to consider the different methods is to ask what patterns of expression are selected by a method and which ones are not. For example, is one method more affected by outliers than another? Or how is it affected by the shape of the distribution? The version number of R and packages loaded for generating this document are: 13 20 ● ● 10 ● 5 10 15 20 25 30 205 200 ● ● ● ● ● 195 ● ● 190 30 ● 185 ● selected no. of diff. exp. genes 40 nrsel.pAUC 0 selected no. of diff. exp. genes nrsel.ttest 35 5 groupsize 10 15 20 25 30 35 groupsize Figure 9: Comparing number of differentially expressed genes for different sample sizes. Left: t-test criterion. Right: pAUC criterion. R version 2.1.0, 2005-04-28, i686-pc-linux-gnu attached base packages: [1] "splines" "tools" [7] "utils" "datasets" "methods" "base" other attached packages: EBarrays lattice ROC "1.0.19" "0.11-6" "1.1.0" vsn ALL Biobase "1.6.0" "1.0.2" "1.5.8" "stats" limma "1.8.16" 14 "graphics" "grDevices" multtest genefilter "1.6.0" "1.6.0" survival "2.17"