Lab: Differential Expression

advertisement
Lab: Differential Expression
R. Gentleman, W, Huber, D. Scholtens, and A. von Heydebreck
June 23, 2005
1
Motivation
In this lab we will cover some of the basic principles of finding differentially expressed genes. It is
based on two publications by ? and ?.
There are many different ways to detect differentially expressed genes, Rather than prescribe a
standard way of doing this, in this lab we will export a variety of these. The goal is to give you an
overview of the existing ideas, so you can make an appropriate choice in your own analyses.
1.1
The gene-by-gene approach
In current practice, differential expression analysis is generally done using a gene-by-gene approach,
ignoring the dependencies between genes, e. g. their interrelationships in “pathways”. Clearly, this
is not satisfactory, and it will change as we learn more. For the purpose of this lab, we cover the
gene-by-gene approach.
1.2
Non-specific prefiltering
Most microarrays contain probes for many more genes than will be differentially expressed. Indeed,
one of the basic assumptions of normalization is that most genes are not differentially expressed. To
alleviate the loss of power from the formidable multiplicity of gene-by-gene hypothesis testing, we
advise that some form of non-specific prefiltering should be carried out. By non-specific we mean
that it is done without reference to phenotype. Its (sole) aim is to remove from consideration that
set of probes whose genes are not differentially expressed under any comparison. We have found it
most useful to select genes on the basis of variability (?). Only the genes that show any variation
across samples can potentially be differentially expressed between our groups of interest.
1.3
Fold-change versus t-test
The simplest approach is to select genes using a fold-change criterion. This may be the only
possibility in cases where few replicates are available. An analysis solely based on fold change
however precludes the assessment of significance of observed differences in the presence of biological
and experimental variation, which may differ from gene to gene. This is the main reason for using
statistical tests to assess differential expression.
In general, one might look at all sorts of differences between the distributions of a gene’s expression
levels under different conditions. Most often, the location (e.g. mean or median) parameter is
considered. This leads to the t-test and its variations. As we have discussed in the lectures, there
1
are also good reasons to consider other criteria, such as the partial area under the ROC curve
(pAUC).
One may distinguish between parametric tests, such as the t–test, and non-parametric tests, such
as the Mann–Whitney test or permutation tests. Parametric tests usually have a higher power if
the underlying model assumptions, such as Normality in the case of the t–test, are at least approximately fulfilled. Non–parametric tests have the advantage of making less stringent assumptions on
the data–generating distribution. In many microarray studies however, a small sample size leads to
insufficient power for non–parametric tests. A pragmatic approach in these situations is to employ
parametric tests, but to use the resulting p–values cautiously to rank genes by their evidence for
differential expression.
2
Non-specific filtering
First load the Biobase package and the data set ALL (acute lymphoblastic leukemia) from the
package of the same name. Since the data in ALL are large and phenotypically quite diverse, we
reduce the cases down to a reasonable two group comparison. First, we select only the B-cell
tumors, then among these, the ones that have either a negative or a positive status with respect to
the BCR/ABL mutation.
> library("Biobase")
> library("ALL")
> data("ALL")
> names(pData(ALL))
[1]
[5]
[9]
[13]
[17]
[21]
"cod"
"BT"
"t(4;11)"
"mol.biol"
"ccr"
"date last seen"
"diagnosis"
"remission"
"t(9;22)"
"fusion protein"
"relapse"
"sex"
"CR"
"cyto.normal"
"mdr"
"transplant"
"age"
"date.cr"
"citog"
"kinet"
"f.u"
> ALL$BT
[1] B2 B2 B4 B1 B2 B1
[26] B2 B1 B2 B1 B2 B
[51] B2 B3 B4 B3 B3 B3
[76] B2 B2 B1 B3 B4 B4
[101] T2 T T4 T2 T3 T3
[126] T3 T2 T
Levels: B B1 B2 B3 B4 T
B1
B
B4
B2
T
B1
B2
B3
B2
T2
B2
B2
B3
B3
T3
B2
B2
B1
B4
T2
B3
B1
B1
B4
T2
B3
B2
B1
B4
T2
B3
B2
B1
B1
T1
B2
B2
B3
B2
T4
B3
B2
B3
B2
T
B
B2
B3
B2
T2
B2
B4
B3
B1
T3
B3
B4
B3
B2
T2
B2
B2
B3
B
T2
B3
B2
B3
B
T2
T1 T2 T3 T4
> ALLB = ALL[, grep("^B", as.character(ALL$BT))]
> ALLBCRNEG = ALLB[, ALLB$mol == "BCR/ABL" | ALLB$mol == "NEG"]
> ALLBCRNEG$mol.biol = factor(ALLBCRNEG$mol.biol)
2
B2
B2
B3
T
T2
B2
B4
B1
T3
T3
B2
B2
B3
T2
T3
B1
B1
B1
T2
T3
B1
B2
B4
T3
T2
Now, if we are going to filter on the basis of variability, we might first want to make sure that the
variability is not dominated by its dependence on the level. If it were, then selecting on the basis
of variability would be confounded with selection on the basis of level. There are good reasons not
to use level for gene selection.
To check for an association we plot row-wise means versus row-wise standard deviations and plot
these together with a smoothed estimate of their regression.
> library("vsn")
> meanSdPlot(ALLBCRNEG)
1.5
0.5
1.0
sd
2.0
2.5
The result is shown in Figure 1.
●
● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
0
2000
6000
10000
rank(mean)
Figure 1: Row-wise means versus row-wise standard deviations of the ALL data.
Exercise 1
a) Comment on the plot, do you think that the relationship between mean and standard deviation
is sufficiently weak?
b) Have a look at the manual page of meanSdPlot. What is the use of the ranks parameter?
Presuming that we decide that the relationship is not very strong - we proceed. Our next step is
to select some proportion of the genes with a relatively large spread. Let’s say the top 20%.
> sds = esApply(ALL, 1, sd)
> sel = (sds > quantile(sds, 0.8))
> ALLset1 = ALLBCRNEG[sel, ]
A potential drawback of this approach is found in situations where we are interested in a phenotype
that has relatively few members. In this case, a gene which is differentially expressed between that
3
group and the other(s) may not have a large overall standard deviation. How would you address
this situation?
At this point you may want to try and look at some heatmaps of the data to see if there are any
obvious patterns. Consult the manual page of the function by typing: ?heatmap
3
Differential Expression
In Bioconductor, the genefilter package allows you to easily select genes using a variety of filters. Additionally, for some tests and comparisons we have developed fast versions. These include rowttests,
which perform a t-test for every row in a gene expression matrix, rowFtests, which does F -tests,
and rowQ, which calculates a quantile for each row.
We have selected a subset of the data with two prominent phenotypes, BCR/ABL and NEG and will
now consider some different methods for finding differentially expressed genes. First, and perhaps
easiest is to use a t-test.
> library("genefilter")
> tt = rowttests(ALLset1, "mol")
> names(tt)
[1] "statistic" "dm"
"df"
"p.value"
Consult the manual page for rowttests for the meaning of the four different elements of the return
value tt.
Many practitioners have learned that small p-values do not always correspond to genes for which
there have been large changes. Let us look at the so-called volcano plot.
> plot(tt$statistic, abs(tt$dm), pch = ".", xlab = "t-statistic",
+
ylab = "absolute value of mean fold change")
The result is shown in Figure 2.
Exercise 2
Determine whether there is a small set of probes that correspond to differentially expressed genes,
using the t-test results.
4
Multiple Testing
One of the subject areas that has received a great deal of attention is that of multiple testing. We
provide a brief introduction to the functionality in the multtest package.
Many of the algorithms in the multtest package depend on random permutations of the samples. The
number of permutations is controlled by the parameter B. For the purpose of these exercises, since
the permutations may take a long time, we choose a rather small value of B. In real applications,
you should use one that is 10 or 100 times larger.
4
1.5
1.0
0.5
0.0
absolute value of mean fold change
−4
−2
0
2
4
6
8
t−statistic
Figure 2: Volcano plot
>
>
>
>
>
library("multtest")
cl = as.numeric(ALLset1$mol) - 1
resT = mt.maxT(exprs(ALLset1), classlabel = cl, B = 1000)
ord = order(resT$index)
rawp = resT$rawp[ord]
The next figure shows the histogram of unadjusted permutation p-values, as given by the vector
rawp. The high proportion of small p–values suggests that a substantial fraction of the genes are
differentially expressed between the two groups.
> hist(rawp, breaks = 50, col = "#B2DF8A")
The result is shown in Figure 3.
In order to control the family-wise error rate (FWER), that is, the probability of at least one false
positive in the set of significant genes, we have used the permutation-based maxT-procedure of
Westfall and Young (?), as implemented in the function mt.maxT. We obtain 34 genes with an
adjusted p-value below 0.05:
> sum(resT$adjp < 0.05)
[1] 34
A comparison of this number to the height of the leftmost bar in the histogram suggests that we
may be missing a large number of differentially expressed genes. The FWER is a very stringent
criterion, and in some microarray studies, only few genes may be significant in this sense, even
if many more are truly differentially expressed. A more liberal criterion is provided by the false
5
200
0
100
Frequency
300
400
Histogram of rawp
0.0
0.2
0.4
0.6
0.8
1.0
rawp
Figure 3: Histogram of rawp, the unadjusted p-values.
discovery rate (FDR), that is, the expected proportion of false positives among the genes that are
called significant. We can use the procedure of ? as implemented in multtest to control the FDR
(note however that this procedure makes certain assumptions on the dependence structure between
genes):
> res = mt.rawp2adjp(rawp, proc = "BH")
> sum(res$adjp[, "BH"] < 0.05)
[1] 220
5
limma
A t-test analysis can also be conducted with functions of the limma package. First, we have to
define the design matrix. One possibility is to use an intercept term that represents the mean log
intensity of a gene across all samples (first column consisting of 1s), and to encode the difference
between the two classes in the second column.
>
>
>
>
library("limma")
labels = as.numeric(ALLset1$mol == "BCR/ABL")
design = cbind(mean = 1, diff = labels)
design
Next a linear model is fitted for every gene by the function lmFit, and Empirical Bayes moderation
of the standard errors is done by the function eBayes.
6
> fit = lmFit(ALLset1, design)
> fit = eBayes(fit)
> topTable(fit, coef = "diff", adjust.method = "fdr")
156
1915
155
163
2066
2014
1262
437
1269
1366
ID
1636_g_at
39730_at
1635_at
1674_at
40504_at
40202_at
37015_at
32434_at
37027_at
37403_at
M
1.100012
1.152527
1.202675
1.427212
1.181029
1.779378
1.032702
1.678550
1.348702
1.117721
A
9.196420
9.000049
7.897095
5.001771
4.244478
8.621443
4.330511
4.466311
8.444161
5.086540
t
9.033955
8.587774
7.338622
7.050134
6.664739
6.391779
6.242224
5.971866
5.805421
5.483354
P.Value
1.232012e-10
4.894886e-10
1.034346e-07
2.874800e-07
1.298466e-06
3.628723e-06
5.996667e-06
1.696739e-05
3.077880e-05
1.077006e-04
B
21.293157
19.341283
13.905543
12.667763
11.032194
9.889235
9.269402
8.161945
7.489382
6.210589
When you compare the resulting p-value with those from the parametric t-test, you will see that
they are almost identical:
> plot(-log10(tt$p.value), -log10(fit$p.value[, "diff"]), xlab = "-log10(p) from two-sample t-t
+
ylab = "-log10(p) from moderated t-test (limma)", pch = ".")
> abline(c(0, 1), col = "red")
12
10
8
6
4
2
0
−log10(p) from moderated t−test (limma)
The result is shown in Figure 4.
0
2
4
6
8
10
12
−log10(p) from two−sample t−test
Figure 4: Comparison of p-values from the unmoderated and moderated t-test in a situation with
a large number of samples in both groups.
7
Because of the large number of samples, the Empirical Bayes moderation is not so relevant here:
in these data set the gene specific variance can well be estimated from the data of each gene.
5.1
Small sample sizes
However, the Empirical Bayes moderation may be quite useful in cases with fewer replicates. Let
us draw a subsample with 3 arrays from each group from our data:
> subs = c(35, 65, 75, 1, 69, 71)
> ALLset2 = ALLBCRNEG[, subs]
> table(ALLset2$mol)
BCR/ABL
3
NEG
3
We repeat the testing procedure in the same way as before,
> tt2 = rowttests(ALLset2, "mol")
> fit2 = eBayes(lmFit(ALLset2, design = design[subs, ]))
and plot the results in Figure 5.
4
3
2
1
0
−log10(p) from moderated t−test (limma)
> plot(-log10(tt2$p.value), -log10(fit2$p.value[, "diff"]), xlab = "-log10(p) from two-sample t
+
ylab = "-log10(p) from moderated t-test (limma)", pch = ".")
> abline(c(0, 1), col = "red")
0
1
2
3
4
−log10(p) from two−sample t−test
Figure 5: Comparison of p-values from the unmoderated and moderated t-test in a situation with
a small number of samples in both groups.
8
Let us have a look at a gene which has a small p-value in the normal t–test but a large one in the
moderated test:
> g = which(tt2$p.value < 1e-04 & fit2$p.value[, "diff"] > 0.02)
> plot(exprs(ALLset2)[g, ], pch = 16, col = as.numeric(ALLset2$mol))
The plot is shown in Figure 6.
●
●
5.8
5.7
exprs(ALLset2)[g, ]
5.9
●
5.6
●
●
●
1
2
3
4
5
6
Index
Figure 6: Expression values of 36404at. This probe set has a highly significant p–value in the
unmoderated t–test, but is unremarkable in the moderated test. Note the y–axis scaling: while
the two groups look well-separated, the absolute difference is small. Most likely, this is a chance
artifact. You can verify this by looking of the expression values of this probe set in the other
samples that were not used here.
6
ROC
The receiver operator curve (ROC) is a popular method for finding variables that discriminate
between two groups ?. The package ROC can be used to compute these curves. In this section we
look at some exercises using this package.
We want to find marker genes that are specifically expressed in leukemias with the BCR/ABL
translocation. At a specificity of at least 0.9, we would like to identify the genes with the best
sensitivity for the BCR/ABL phenotype. This can be expressed by the partial area under the ROC
curve (pAUC, we choose t0 = 0.1). To limit the computation time, we compute the pAUC statistic
only for the first 100 probe sets.
9
> library("ROC")
> dxrule.sca
function (x, thresh)
ifelse(x > thresh, 1, 0)
dxrule.sca is an example for a simple one-marker classification rule: if the value of the marker is
greater than a threshold, the case is assigned to group 1, otherwise to group 0. We now define a
function myRule that encapsulates this rule in the way that it is used by the helper functions of
the ROC package.
> myRule = function(x) {
+
pAUC(rocdemo.sca(truth = labels, data = x, rule = dxrule.sca),
+
t0 = 0.1)
+ }
> pAUC1s = esApply(ALLset1[1:100, ], 1, myRule)
Next we will select the probe set with the maximal value of our pAUC statistic, and plot the
corresponding ROC curve.
> j = which.max(pAUC1s)
> RC = rocdemo.sca(truth = labels, data = exprs(ALLset1)[j, ],
+
rule = dxrule.sca, caseLabel = "new case", markerLabel = geneNames(ALLset1)[j])
And the figure,
> plot(RC, main = geneNames(ALLset1)[j])
The result is shown in Figure 7.
7
A Bayesian Approach
A different approach to dealing with the multiplicity problem is to adopt a Bayesian perspective.
Kendiorski et al. have produced the EBarrays package that provides some Bayesian modeling
capabilites. The general idea is to estimate the posterior probability that genes are differentially
expressed. We refer the reader to their papers and the vignette in the package for more details.
Here, we concentrate on how to use the software.
> library("EBarrays")
The package requires a slightly peculiar way of specifying the groups of interest:
> pattern = ebPatterns(c("1,1,1,1,1,1", "1,1,1,2,2,2"))
> em = emfit(ALLset2, family = "GG", hypotheses = pattern, num.iter = 10)
> empp = postprob(em, ALLset2)
Exercise 3
Examine a histogram of these p-values – you may find it helpful to consult the manual page first.
Consider comparing the EBarrays p-values with those computed using limma. Those computed by
EBarrays correspond to the posterior probability of differential expression; how would you decide
which genes are differentially expressed? Using your criteria, how would you compare that to the
limma results.
10
0.6
0.4
0.0
0.2
sens: new case
0.8
1.0
1326_at
●●●●●●
●●●●●
●
●●●●●●●●●
●
●●●
●●
●●
●
●●
●●●●●●
●●●●
●
●
●●
●●●
●
●●●
●●
●
●
●
●●●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
1−spec: 1326_at
Figure 7: ROC curve of a gene that separates well between BCR/ABL positive and negative tumors.
> hist(empp[, "P2"], 100)
> plot(-log10(fit2$p.value[, "diff"]), empp[, "P2"], pch = ".")
8
Comparisons
In Sections 3 and 6 you have learned how to find differentially expressed genes using two different
criteria: Location changes in the distribution of a gene’s expression and its receiver operating
characteristic. Let us now compare these methods and focus on the influence of sample size.
The code used in this section is rather advanced and needs the update of package genefilter to a
version >= 1.7.7. You don’t necessarily have to reproduce it here, but try to understand what it
does and take a look at the output in Figure 9.
> library(reposTools)
> update.packages2("genefilter", develOK = TRUE)
In your updated version of genefilter you will now find the new function rowpAUCs. It does the
same thing as functions rocdemo.sca and pAUC in package ROC, that is compute ROC curves and
pAUCs, but it was designed to be much faster and memory efficient. It can deal with exprSets
and works through them row-by-row. You may want to look into rowpAUCs’s documentation for
further information.
We want to see how the sample size affects the number of genes that are found to be differentially
expressed in the two methods. For this, we need to divide our data set in random sample subsets
and do our computations on these subsets.
11
0
0.6
0.0
0.2
0.4
empp[, "P2"]
6000
2000
Frequency
0.8
1.0
10000
Histogram of empp[, "P2"]
0.0
0.2
0.4
0.6
0.8
1.0
0
empp[, "P2"]
1
2
3
4
−log10(fit2$p.value[, "diff"])
Figure 8: Left: histogram of posterior probabilities for differential expression between the
BCR/ABL positive and negative tumors. Right: Comparison of the posterior probabilities with
the (negative logarithm of the) p-values from the limma analysis.
First we define a function that produces a logical vector out of an exprSet. The length of the vector
is the number of samples, and it is intended to describe a grouping of the samples. (We did a
similar thing in section 5 to get the labels vector. Since we need this in the following steps more
than once, we now make it a function.)
> classlabel <- function(x) {
+
stopifnot(class(x) == "exprSet", "mol.biol" %in% colnames(pData(x)))
+
return(as.numeric(pData(x)$mol.biol == "BCR/ABL"))
+ }
Now let us build wrapper functions arround rowttests
> nrsel.ttest = function(x, pthresh = 0.05) {
+
pval = rowttests(x, "mol")$p.value
+
return(sum(pval < pthresh))
+ }
and rowpAUCs.
> nrsel.pAUC = function(x, pAUCthresh = 0.03) {
+
pAUC = rowpAUCs(x, fac = classlabel(x), p = 0.1)$pAUC
+
return(sum(pAUC > pAUCthresh))
+ }
12
Note that the choices of thresholds are, as always, somewhat arbitrary, and that the one for the
t-test, pthresh, is not directly comparable to the one for rowpAUCs, pAUCthresh.
What we need now is a function that does some resampling for various data set sizes. To further
reduce computation time, we only take a subset of 500 randomly sampled genes from our data set
(you can try using more).
> resample <- function(selfun, groupsize = seq(6, 36, by = 6),
+
nrep = 25) {
+
ALLset3 <- ALLset1[sample(1:nrow(exprs(ALLset1)), 500), ]
+
n <- matrix(nrow = nrep, ncol = length(groupsize))
+
for (i in seq(along = groupsize)) {
+
for (rep in 1:nrep) {
+
samplesubset <- c(sample(which(classlabel(ALLset3) ==
+
0), groupsize[i]), sample(which(classlabel(ALLset3) ==
+
1), groupsize[i]))
+
n[rep, i] <- do.call(selfun, args = list(x = ALLset3[,
+
samplesubset]))
+
cat(".")
+
}
+
cat("\n")
+
}
+
mns <- apply(n, 2, mean)
+
stds <- apply(n, 2, sd)
+
plot(groupsize, mns, pch = 16, col = "#c03030", ylim = c(min(mns +
stds) - 5, max(mns + stds) + 5), xlab = "groupsize",
+
ylab = "selected no. of diff. exp. genes", main = selfun)
+
segments(groupsize, mns - stds, groupsize, mns + stds, col = "red")
+ }
Run the function, once with nrsel.ttest and once with nrsel.pAUC.
> par(mfrow = c(1, 2))
> resample("nrsel.ttest")
> resample("nrsel.pAUC")
Exercise 4
Which of the two criteria seems more reasonable?
In this vignette you have seen a number of different ways to determine interesting genes - or genes
that are differentially expressed. Compare some of the lists - what do you think about the amount
of overlap? Do you have some idea how you might decide whether one list is better than some of
the others?
Another way to consider the different methods is to ask what patterns of expression are selected
by a method and which ones are not. For example, is one method more affected by outliers than
another? Or how is it affected by the shape of the distribution?
The version number of R and packages loaded for generating this document are:
13
20
●
●
10
●
5
10
15
20
25
30
205
200
●
●
●
●
●
195
●
●
190
30
●
185
●
selected no. of diff. exp. genes
40
nrsel.pAUC
0
selected no. of diff. exp. genes
nrsel.ttest
35
5
groupsize
10
15
20
25
30
35
groupsize
Figure 9: Comparing number of differentially expressed genes for different sample sizes. Left: t-test
criterion. Right: pAUC criterion.
R version 2.1.0, 2005-04-28, i686-pc-linux-gnu
attached base packages:
[1] "splines"
"tools"
[7] "utils"
"datasets"
"methods"
"base"
other attached packages:
EBarrays
lattice
ROC
"1.0.19"
"0.11-6"
"1.1.0"
vsn
ALL
Biobase
"1.6.0"
"1.0.2"
"1.5.8"
"stats"
limma
"1.8.16"
14
"graphics"
"grDevices"
multtest genefilter
"1.6.0"
"1.6.0"
survival
"2.17"
Download