Tuning Tutorial for a quantitative trait

advertisement
Tutorial for the randomGLM R package:
Parameter tuning based on OOB prediction
for quantitative outcomes
Lin Song, Steve Horvath
In this tutorial, we show how to fine-tune RGLM parameters based on out-of-bag (OOB) prediction
performance for quantitative outcome prediction. Similar to a cross validation estimate, the OOB
prediction of the accuracy is nearly unbiased, making it a good criterion for judging how parameter
choices affect the prediction accuracy.
Data:
We use a mouse adipose tissue gene expression data from the lab of Jake Lusis (citation Ghazalpour et
al 2006). These gene expression involve 5000 genes (features) across n=239 mice (observations). Here
we aim to predict y=mouse length (cm) based on the 5000 genes. The data can be found on our
webpage at http://labs.genetics.ucla.edu/horvath/RGLM. Details of this data set are explained in [1].
1. Data preparation
# load required package
library(randomGLM)
# download data from webpage and load it.
# Importantly, change the path to the data and use back slashes /
setwd("C:/Users/Horvath/Documents/CreateWebpage/RGLM/Tutorials/")
load("mouse.rda")
# check data
dim(mouse$x)
summary(mouse$y)
x = mouse$x
y = mouse$y
N = ncol(x)
If there is a large number of observations in the training set, for example >2000, the computation of
RGLM will become very time consuming. We suggest doing parameter tuning with random subset of
observations. Example code is as follows:
nObsSubset=500
subsetObservations=sample(1:dim(x)[[1]], nObsSubset,replace=FALSE)
x.Subset=x[subsetObservations,]
y.Subset=y[subsetObservations]
1
2. Choose accuracy measure
When it comes to measuring the prediction accuracy for a continuous variable (e.g. mouse length) one
can use many measures. The following measures are often used. Please choose one of these measures
for your application. In this tutorial, we choose prediction correlation.
# different kinds of accuracy measures for a continuous outcome
accuracyCor=function(y.predicted,y) cor(y.predicted, y, use="p")
accuracyMSE=function(y.predicted,y) mean((y.predicted-y)^2,na.rm=TRUE)
accuracyMedAbsDev=function(y.predicted,y) {median( abs(y.predicted-y),na.rm=TRUE)}
# Note that an accurate predictor will obtain a high value for the accuracyCor (correlation
between predicted and observed outcome) but a low value for accuracyMSE (the mean square
error) and a low value for accuracyMedAbsDev (median absolute deviation).
# In this tutorial, we choose accuracyCor,
i.e. we want to choose parameters so that it obtains a high value.
accuracyM = accuracyCor
3. Should one include pairwise interactions between the features?
RGLM allows interaction terms added among features in GLM construction through parameter
“maxInteractionOrder”, which has a big effect on prediction performance. Genrally, we do not
recommend 3rd or higher order interactions, because of the computation burden and little performance
improvement. For high dimensional data such as the mouse data set, we recommend using no
interaction at all, because we already have enough features to produce instability in GLMs. But here
for illustration purpose, we compare “no interaction” to “2nd interaction”. Note that we use only 50
bags instead of default 100 in the following to facilitate parameter tuning.
# RGLM
RGLM = randomGLM(x, y,
classify=FALSE,
nBags=50,
keepModels=TRUE)
# accuracy
accuracyM(RGLM$predictedOOB, y)
# [1] 0.5870843
# RGLM with pairwise interaction between features
# Parallel running is highly recommended, which is implemented by parameter
nThreads.
RGLM.inter2 = randomGLM(x, y,
classify=FALSE,
maxInteractionOrder=2,
nBags=50,
keepModels=TRUE)
# accuracy
2
accuracyM(RGLM.inter2$predictedOOB, y)
# 0.5792032
As expected, adding pairwise interaction terms to these gene expression data is not worth the trouble.
As a matter of fact, interaction terms lead to a slightly decreased OOB accuracy. In general, we advise
against pairwise interaction terms when dealing with gene expression data.
4. Feature selection parameters tuning (focusing on one parameter at a time)
There are two major feature selection parameters that affect the performance of RGLM:
nFeaturesInBag and nCandidateCovariates. nFeaturesInBag controls the number of features randomly
selected in each bag (random subspace). nCandidateCovariates controls the number of covariates used
in GLM model selection. Now we show how to sequentially tune these 2 parameters.
# choose nFeaturesInBag
# consider the following proportions of the total features
proportionOfFeatures= c(0.1, 0.2, 0.4, 0.6, 0.8, 1)
nFeatureInBagVector=ceiling(proportionOfFeatures *N)
# define vector that saves prediction accuracies
acc = rep(NA, length(nFeatureInBagVector))
# loop over nFeaturesInBag values, calculate individual accuracy
for (i in 1:length(nFeatureInBagVector))
{
cat("step", i, "out of ", length(nFeatureInBagVector), "entries from
nFeatureInBagVector\n")
RGLMtmp = randomGLM(x, y,
classify=FALSE,
nFeaturesInBag = nFeatureInBagVector[i],
nBags=50,
keepModels=TRUE)
predicted = RGLMtmp$predictedOOB
acc[i] = accuracyM(predicted, y)
rm(RGLMtmp, predicted)
}
data.frame(proportionOfFeatures, nFeatureInBagVector,acc)
# Accuracy is highest when nFeatureInBag takes 60% of all features.
# view by plot
pdf("~/Desktop/gene_screening/package/nFeaturesInBagQuantitative.pdf",5,5)
plot(nFeatureInBagVector,acc,ylab="OOB
accuracy
(correlation)",xlab="nFeatureInBag",main="Choosing nFeatureInBag",type="l")
text(nFeatureInBagVector,acc,lab= nFeatureInBagVector)
dev.off()
3
Here nFeaturesInBag equal to 60% of all features (resulting in 3000 features) results in the highest
OOB prediction accuracy. In the following, we assume that nFeaturesInBag has been fixed to 3000.
# choose nCandidateCovariates
# consider 6 values
nCandidateCovariatesVector=c(5,10,20,30,50,75,100)
# define vector that saves prediction accuracies
acc1 = rep(NA, length(nCandidateCovariatesVector))
# loop over nCandidateCovariates values, calculate individual accuracy
for (j in 1:length(nCandidateCovariatesVector))
{
cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from
nCandidateCovariatesVector\n")
RGLMtmp = randomGLM(x, y,
classify=FALSE,
nFeaturesInBag = ceiling(0.6*N),
nCandidateCovariates = nCandidateCovariatesVector[j],
nBags=50,
keepModels=TRUE)
predicted = RGLMtmp$predictedOOB
acc1[j] = accuracyM(predicted, y)
rm(RGLMtmp, predicted)
}
data.frame(nCandidateCovariatesVector,acc1)
#
nCandidateCovariatesVector
acc1
#1
5 0.5624270
#2
10 0.5670274
#3
20 0.5822062
4
#4
30 0.6000472
#5
50 0.6061715
#6
75 0.6036171
#7
100 0.5783763
# nCandidateCovariates in range 30-75 gives the highest accuracy. Therefore,
we choose nCandidateCovariates=50, which equals the default value of this
parameter.
# view by plot
pdf("~/Desktop/gene_screening/package/nCandidateCovariatesQuantitative.pdf"
,5,5)
plot(nCandidateCovariatesVector,acc1,ylab="OOB
accuracy
(correlation)",xlab="nCandidateCovariates",main="Choosing
nCandidateCovariates",type="l")
text(nCandidateCovariatesVector,acc1,lab= nCandidateCovariatesVector)
dev.off()
nCandidateCovariates=50 corresponds to the highest accuracy. Note that this is also the default value
for nCandidateCovariates.
5. Optimzing the parameter values by varying both at the same time.
Previously, we tuned (chose) nFeaturesInBag and nCandidateCovariates one at a time. But clearly it is
preferable to consider both parameter choices at the same time, i.e. to see how the accuracy changes
over a grid of possible values.
# choose nFeaturesInBag and nCandidateCovariates at the same time
nFeatureInBagVector=ceiling(c(0.1, 0.2, 0.4, 0.6, 0.8, 1)*N)
nCandidateCovariatesVector=c(5,10,20,30,50,75,100)
# define vector that saves prediction accuracies
5
acc2=matrix(NA,length(nFeatureInBagVector),length(nCandidateCovariatesVecto
r))
rownames(acc2) = paste("feature", nFeatureInBagVector, sep="")
colnames(acc2) = paste("cov", nCandidateCovariatesVector, sep="")
# loop over nFeaturesInBag and nCandidateCovariates values, calculate individual
accuracy
for (i in 1:length(nFeatureInBagVector))
{
cat("step",
i,
"out
of
",
length(nFeatureInBagVector),
"entries
from
nFeatureInBagVector\n")
for (j in 1:length(nCandidateCovariatesVector))
{
cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from
nCandidateCovariatesVector\n")
RGLMtmp = randomGLM(x, y,
classify=FALSE,
nFeaturesInBag = nFeatureInBagVector[i],
nCandidateCovariates = nCandidateCovariatesVector[j],
nBags=50,
keepModels=TRUE)
predicted = RGLMtmp$predictedOOB
acc2[i, j] = accuracyM(predicted, y)
rm(RGLMtmp, predicted)
}
}
round(acc2,3)
#
cov5 cov10 cov20 cov30 cov50 cov75 cov100
#feature500
0.549 0.555 0.569 0.582 0.573 0.596
0.575
#feature1000 0.553 0.567 0.582 0.583 0.587 0.597
0.566
#feature2000 0.558 0.563 0.586 0.590 0.590 0.571
0.559
#feature3000 0.562 0.567 0.582 0.600 0.606 0.604
0.578
#feature4000 0.565 0.570 0.575 0.582 0.603 0.581
0.557
#feature5000 0.557 0.566 0.575 0.584 0.582 0.568
0.553
# view by plot
# load required library
library(WGCNA)
pdf("~/Desktop/gene_screening/package/parameterChoiceQuantitative.pdf")
par(mar=c(2, 5, 4, 2))
labeledHeatmap(
Matrix = acc2,
yLabels = rownames(acc2),
xLabels = colnames(acc2),
colors = greenWhiteRed(100)[51:100],
6
textMatrix = round(acc2,3),
setStdMargins = FALSE,
xLabelsAngle=0,
xLabelsAdj = 0.5,
main = "Parameter choice")
dev.off()
Message: choosing nFeatureInBag=3000 and nCandidateCovariates=50 leads to the highest prediction
accuracy.
References
1. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate
and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID: 23323760DOI:
10.1186/1471-2105-14-5.
2. Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, Schadt E, Thomas A, Drake T, Lusis A,
Horvath S: Integrating Genetics and Network Analysis to Characterize Genes Related to
MouseWeight. PloS Genetics 2006, 2(2):8. PMID: 16934000
7
Download