Tuning Tutorial for a binary trait

advertisement
Tutorial for the randomGLM R package:
Parameter tuning based on OOB prediction
Lin Song, Steve Horvath
In this tutorial, we show how to fine-tune RGLM parameters based on out-of-bag (OOB) prediction
performance. Similar to a cross validation estimate, the OOB prediction of the accuracy is nearly
unbiased, making it a good criterion for judging how parameter choices affect the prediction accuracy.
Data:
We use the small, round blue cell tumors (srbct) data set [1,2] as an example training data set.
The data can be found on our webpage at http://labs.genetics.ucla.edu/horvath/RGLM.
It is composed of the gene expression profiling of 2308 genes across 63 observations.
1. Data preparation
# load required package
library(randomGLM)
# download data from webpage and load it.
# Importantly, change the path to the data and use back slashes /
setwd("C:/Users/Horvath/Documents/CreateWebpage/RGLM/Tutorials/")
load("srbct.rda")
# check data
dim(srbct$x)
table(srbct$y)
Output:
1
2
40 23
Message: Binary outcome that is poorly balanced since 40/(40+23)=64% of
observations are class 1.
x = srbct$x
y = srbct$y
# number of features
N = ncol(x)
If there is a large number of observations in the training set, for example >2000, the computation of
RGLM will become very time consuming. We suggest doing parameter tuning with random subset of
observations. Example code is as follows:
1
nObsSubset=500
subsetObservations=sample(1:dim(x)[[1]], nObsSubset,replace=FALSE)
x.Subset=x[subsetObservations,]
y.Subset=y[subsetObservations]
2. Define accuracy measures
Many measures can be used to evaluate prediction accuracy, such as proportion classified correctly,
sensitivity, specificity, area under ROC and C index [3]. Users can define a measure according to their
needs. Here we simply use proportion classified correctly. An illustration of C index is also provided at
the end of this tutorial.
accuracyM=function(tab){
num1=sum(diag(tab))
denom1=sum(tab)
signif(num1/denom1,3)
}
3. Should one include pairwise interactions between the features?
RGLM allows interaction terms added among features in GLM construction through parameter
“maxInteractionOrder”, which has a big effect on prediction performance. Genrally, we do not
recommend 3rd or higher order interactions, because of the computation burden and little performance
improvement. For high dimensional data such as the srbct data set, we recommend using no
interaction at all, because we already have enough features to produce instability in GLMs. But here
for illustration purpose, we compare “no interaction” to “2nd interaction”. Note that we use only 50
bags instead of default 100 in the following to facilitate parameter tuning.
# Fit the RGLM with 50 bags but we recommend you use 100 bags
RGLM = randomGLM(x, y,classify=TRUE,nBags=50,keepModels=TRUE)
# accuracy
accuracyM(table(RGLM$predictedOOB, y))
# [1] 1
# RGLM with pairwise interaction between features
# The following calculation could take up to 50 minutes using a single core (i.e.
nThreads=1). Parallel running is highly recommended, which is implemented by
parameter nThreads. As you can learn from the help file or randomGLM,
If nThreads is left at its default value (i.e. NULL), it will be determined
automatically as the number of available cores if the latter is 3 or less, and
number of cores minus 1 if the number of available cores is 4 or more.
RGLM.inter2=randomGLM(x, y, classify=TRUE, maxInteractionOrder=2,
2
nBags=50,keepModels=TRUE,nThreads=NULL)
# accuracy
accuracyM(table(RGLM.inter2$predictedOOB, y))
# [1] 0.9841
As expected, adding pairwise interaction terms decreases the OOB accuracy from 1 to 0.984.
Therefore, it is not worth the trouble to consider interaction here. In general, we do not advise pairwise
interaction terms when dealing with gene expression data or other high dimensional genomic data.
4. Feature selection parameters tuning (focusing on one parameter at a time)
There are two major feature selection parameters that affect the performance of RGLM:
nFeaturesInBag and nCandidateCovariates. nFeaturesInBag controls the number of features randomly
selected in each bag (random subspace). nCandidateCovariates controls the number of covariates used
in GLM model selection. Now we show how to sequentially tune these 2 parameters.
# number of features
N = ncol(x)
# choose nFeaturesInBag among the following possible values
nFeatureInBagVector=ceiling(c(0.1, 0.2, 0.4, 0.6, 0.8, 1)*N)
# define vector that saves prediction accuracies
acc = rep(NA, length(nFeatureInBagVector))
# loop over nFeaturesInBag values, calculate individual accuracy
for (i in 1:length(nFeatureInBagVector))
{
cat("step", i, "out of ", length(nFeatureInBagVector), "entries from
nFeatureInBagVector\n")
RGLMtmp = randomGLM(x,y,classify=TRUE,
nFeaturesInBag = nFeatureInBagVector[i],nBags=50,keepModels=TRUE)
predicted = RGLMtmp$predictedOOB
acc[i] = accuracyM(table(predicted, y))
rm(RGLMtmp, predicted)
}
data.frame(nFeatureInBagVector,acc)
#
nFeatureInBagVector
acc
#1
231 0.9683
#2
462 1.0000
#3
924 0.9841
#4
1385 0.9683
#5
1847 0.9841
#6
2308 0.9841
# Accuracy is highest when nFeatureInBag takes 20% of all features.
# view by plot, choose nFeaturesInBag = 462.
pdf("~/Desktop/gene_screening/package/nFeaturesInBag.pdf",5,5)
plot(nFeatureInBagVector,acc,ylab="OOB
3
accuracy",xlab="nFeatureInBag",main="Choosing nFeatureInBag",type="l")
text(nFeatureInBagVector,acc,lab= nFeatureInBagVector)
dev.off()
Message: any of the considered choices leads to a nearly perfect OOB estimate of the accuracy.
Here nFeaturesInBag equal to 20% of all features (resulting in 462 features) results in the highest
OOB prediction accuracy. Note that this optimized value is the same as the default choice of
nFeaturesInBag.
How to choose nCandidateCovariates?
In the following, we assume that nCandidateCovariates has been set to its default value, i.e.
nFeaturesInBag=ceiling(0.2*N).
# We consider the following choices for nCandidateCovariates
nCandidateCovariatesVector=c(5,10,20,30,50,75,100)
# define a vector that stores the resulting OOB estimates of the accuracies
acc1 = rep(NA, length(nCandidateCovariatesVector))
# loop over nCandidateCovariates values, calculate individual accuracy
for (j in 1:length(nCandidateCovariatesVector))
{
cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from
nCandidateCovariatesVector\n")
RGLMtmp=randomGLM(x,y,classify=TRUE, nFeaturesInBag = ceiling(0.2*N),
nCandidateCovariates=nCandidateCovariatesVector[j],nBags=50,keepModels=T
RUE)
predicted = RGLMtmp$predictedOOB
acc1[j] = accuracyM(table(predicted, y))
4
rm(RGLMtmp, predicted)
}
data.frame(nCandidateCovariatesVector,acc1)
#
nCandidateCovariatesVector acc1
#1
5
1
#2
10
1
#3
20
1
#4
30
1
#5
50
1
#6
75
1
#7
100
1
# All accuracies are equal to 1, suggesting that nCandidateCovariates does not
have a significant effect on the prediction accuracy in this data set.
#
Here
is
a
plot
that
shows
how
the
accuracy
(y-axis)
depends
on
nCandidateCovariates (x-axis).
pdf("~/Desktop/gene_screening/package/nCandidateCovariates.pdf",5,5)
plot(nCandidateCovariatesVector,acc1,ylab="OOB
accuracy",xlab="nCandidateCovariates",main="Choosing
nCandidateCovariates",type="l")
text(nCandidateCovariatesVector,acc1,lab= nCandidateCovariatesVector)
dev.off()
All accuracies are equal to 1, suggesting that nCandidateCovariates do not have a significant effect on
prediction in this data set. In this case, nCandidateCovariates=5 is preferred because of computational
reasons. Recall that the computation time greatly depends on nCandidateCovariates.
5
5. Optimzing the parameter values by varying both at the same time.
Previously, we tuned (chose) nFeaturesInBag and nCandidateCovariates one at a time. But clearly it is
preferable to consider both parameter choices at the same time, i.e. to see how the accuracy changes
over a grid of possible values.
# choose nFeaturesInBag and nCandidateCovariates at the same time
nFeatureInBagVector=ceiling(c(0.1, 0.2, 0.4, 0.6, 0.8, 1)*N)
nCandidateCovariatesVector=c(5,10,20,30,50,75,100)
#Note
that
nFeatureInBagVector
will
be
the
rows
of
the
grid
and
#nCandidateCovariatesVector will be the columns of the grid. For each grid value,
#we calculate the OOB estimate of the accuracy and store it in the following
matrix.
acc2=matrix(NA,length(nFeatureInBagVector),
length(nCandidateCovariatesVector))
rownames(acc2) = paste("feature", nFeatureInBagVector, sep="")
colnames(acc2) = paste("cov", nCandidateCovariatesVector, sep="")
# loop over nFeaturesInBag and nCandidateCovariates values, and calculate the
resulting OOB estimate of the accuracy.
for (i in 1:length(nFeatureInBagVector))
{
cat("step", i, "out of ", length(nFeatureInBagVector), "entries from
nFeatureInBagVector\n")
for (j in 1:length(nCandidateCovariatesVector))
{
cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from
nCandidateCovariatesVector\n")
RGLMtmp=randomGLM(x,y,classify=TRUE,nFeaturesInBag=nFeatureInBagVector[i],
nCandidateCovariates=nCandidateCovariatesVector[j],nBags=50,keepModels=TRUE
)
predicted = RGLMtmp$predictedOOB
acc2[i, j] = accuracyM(table(predicted, y))
rm(RGLMtmp, predicted)
}
}
acc2
#
cov5 cov10
cov20 cov30 cov50 cov75 cov100
#feature231 0.9841 0.9365 0.9365 0.9683 0.9683 0.9683 0.9683
#feature462 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
#feature924 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841
#feature1385 0.9683 0.9683 0.9683 0.9683 0.9683 0.9683 0.9683
#feature1847 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841
#feature2308 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841
6
# Create a heatmap plot visualizes the values in the above matrix
# load required library
library(WGCNA)
# the following is not important for generating a small heat map
allowWGCNAThreads()
# comment: If you get an error message, please ignore it.
# disableWGCNAThreads()
pdf("~/Desktop/gene_screening/package/parameterChoice.pdf")
par(mar=c(2, 5, 4, 2))
labeledHeatmap(Matrix = acc2,yLabels = rownames(acc2),xLabels = colnames(acc2),
colors = greenWhiteRed(100)[51:100],textMatrix = acc2, setStdMargins = FALSE,
xLabelsAngle=0,xLabelsAdj = 0.5,main = "Parameter choice")
dev.off()
Message: There is a very strong signal in the data set: any parameter combination between
nFeatureInBagVector (rows) and nCandidateCovariatesVector (columns) leads to a nearly perfect
(OOB estimate of the) accuracy. For example, choosing nFeatureInBagVector=462 leads to an
accuracy of 1 if nCandidateCovariatesVector lies between 5 and 100.
6. Effects of number of bags on the OOB prediction accuracy
In RGLM, the default number of bags is set to 100. But for previous parameter tuning, we use only 50
bags to save time. In the following we show how the number of bags in RGLM affects the OOB
prediction accuracy.
# choose nBags
# consider 5 values
7
nBagsVector=c(10, 20, 50, 100, 200)
# define vector that saves prediction accuracies
acc3 = rep(NA, length(nBagsVector))
# loop over nBags values, calculate individual accuracy
for (k in 1:length(nBagsVector))
{
print(k)
RGLMtmp = randomGLM(x, y,
classify=TRUE,
nBags=nBagsVector[k],
keepModels=TRUE)
predicted = RGLMtmp$predictedOOB
acc3[k] = accuracyM(table(predicted, y))
rm(RGLMtmp, predicted)
}
data.frame(nBagsVector,acc3)
#
nBagsVector
acc3
#1
10 0.9672
#2
20 0.9683
#3
50 1.0000
#4
100 0.9841
#5
200 1.0000
pdf("~/Desktop/gene_screening/package/nBags.pdf",5,5)
plot(nBagsVector,acc3,ylab="OOB accuracy",xlab="nBags",main="Choosing
nBags",type="l")
text(nBagsVector,acc3,lab= nBagsVector)
dev.off()
We can see that OOB accuracy increases in the first few dozens of bags. 50 bags is enough for
8
parameter tuning.
7. RGLM parameter tuning based on C index
As mentioned previously, users can define their own accuracy measures to tune RGLM parameters.
Here we briefly illustrate how to choose parameters based on Harrell’s C index [3].
# define accuracy measure
library(Hmisc)
Cindex=function(classProbabilities, y) { rcorr.cens(classProbabilities[,2], y,
outx=TRUE)[1]}
accuracyMClassProb=Cindex
# choose nFeaturesInBag and nCandidateCovariates at the same time
nFeatureInBagVector=ceiling(c(0.1, 0.2, 0.4, 0.6, 0.8, 1)*N)
nCandidateCovariatesVector=c(5,10,20,30,50,75,100)
# define vector that saves prediction accuracies
acc4 = matrix(NA, length(nFeatureInBagVector),
length(nCandidateCovariatesVector))
rownames(acc4) = paste("feature", nFeatureInBagVector, sep="")
colnames(acc4) = paste("cov", nCandidateCovariatesVector, sep="")
# loop over nFeaturesInBag and nCandidateCovariates values, calculate individual
accuracy
for (i in 1:length(nFeatureInBagVector))
{
cat("step",
i,
"out
of
",
length(nFeatureInBagVector),
"entries
from
nFeatureInBagVector\n")
for (j in 1:length(nCandidateCovariatesVector))
{
cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from
nCandidateCovariatesVector\n")
RGLMtmp = randomGLM(x, y,
classify=TRUE,
nFeaturesInBag = nFeatureInBagVector[i],
nCandidateCovariates = nCandidateCovariatesVector[j],
nBags=50,
keepModels=TRUE)
predictedProb = RGLMtmp$predictedOOB.response
acc4[i, j] = accuracyMClassProb(predictedProb, y)
rm(RGLMtmp, predictedProb)
}
}
acc4
9
#
cov5
cov10
cov20
cov30
cov50
cov75
cov100
#feature231
1.0000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
#feature462
1.0000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
#feature924
1.0000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
#feature1385 0.9989130 0.998913 0.998913 0.998913 1.000000 1.000000 1.000000
#feature1847 1.0000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
#feature2308 0.9978261 0.998913 0.998913 0.998913 0.998913 0.998913 0.998913
# view by plot
# load required library
library(WGCNA)
pdf("~/Desktop/gene_screening/package/parameterChoiceCindex.pdf")
par(mar=c(2, 5, 4, 2))
labeledHeatmap(
Matrix = acc4,
yLabels = rownames(acc4),
xLabels = colnames(acc4),
colors = greenWhiteRed(100)[51:100],
textMatrix = signif(acc4,3),
setStdMargins = FALSE,
xLabelsAngle=0,
xLabelsAdj = 0.5,
main = "Parameter choice")
dev.off()
The message is the same as the analysis using proportion classified correctly. There is a very strong
signal in the data set: any parameter combination between nFeatureInBagVector (rows) and
10
nCandidateCovariatesVector (columns) leads to a nearly perfect (OOB estimate of the) accuracy.
References
1. Khan J,Wei JS, Ringner M, Saal LH, Ladanyi M,Westermann F, Berthold F, Schwab M, Antonescu
CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression
profiling
and
artificial
neural
networks.
Nature
Medicine
2001,
7(6):673–679,
[http://dx.doi.org/10.1038/89044].
2. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and
interpretable
ensemble
predictor.
BMC
Bioinformatics
14:5
PMID:
23323760DOI:
10.1186/1471-2105-14-5.
3. Harrell Jr, F. E., Califf, R. M., Pryor, D. B., Lee, K. L., & Rosati, R. A. (1982). Evaluating the
yield of medical tests. JAMA: the journal of the American Medical Association, 247(18),
2543-2546.
11
Download