Tutorial for the randomGLM R package: Parameter tuning based on OOB prediction for quantitative outcomes Lin Song, Steve Horvath In this tutorial, we show how to fine-tune RGLM parameters based on out-of-bag (OOB) prediction performance for quantitative outcome prediction. Similar to a cross validation estimate, the OOB prediction of the accuracy is nearly unbiased, making it a good criterion for judging how parameter choices affect the prediction accuracy. Data: We use a mouse adipose tissue gene expression data from the lab of Jake Lusis (citation Ghazalpour et al 2006). These gene expression involve 5000 genes (features) across n=239 mice (observations). Here we aim to predict y=mouse length (cm) based on the 5000 genes. The data can be found on our webpage at http://labs.genetics.ucla.edu/horvath/RGLM. Details of this data set are explained in [1]. 1. Data preparation # load required package library(randomGLM) # download data from webpage and load it. # Importantly, change the path to the data and use back slashes / setwd("C:/Users/Horvath/Documents/CreateWebpage/RGLM/Tutorials/") load("mouse.rda") # check data dim(mouse$x) summary(mouse$y) x = mouse$x y = mouse$y N = ncol(x) If there is a large number of observations in the training set, for example >2000, the computation of RGLM will become very time consuming. We suggest doing parameter tuning with random subset of observations. Example code is as follows: nObsSubset=500 subsetObservations=sample(1:dim(x)[[1]], nObsSubset,replace=FALSE) x.Subset=x[subsetObservations,] y.Subset=y[subsetObservations] 1 2. Choose accuracy measure When it comes to measuring the prediction accuracy for a continuous variable (e.g. mouse length) one can use many measures. The following measures are often used. Please choose one of these measures for your application. In this tutorial, we choose prediction correlation. # different kinds of accuracy measures for a continuous outcome accuracyCor=function(y.predicted,y) cor(y.predicted, y, use="p") accuracyMSE=function(y.predicted,y) mean((y.predicted-y)^2,na.rm=TRUE) accuracyMedAbsDev=function(y.predicted,y) {median( abs(y.predicted-y),na.rm=TRUE)} # Note that an accurate predictor will obtain a high value for the accuracyCor (correlation between predicted and observed outcome) but a low value for accuracyMSE (the mean square error) and a low value for accuracyMedAbsDev (median absolute deviation). # In this tutorial, we choose accuracyCor, i.e. we want to choose parameters so that it obtains a high value. accuracyM = accuracyCor 3. Should one include pairwise interactions between the features? RGLM allows interaction terms added among features in GLM construction through parameter “maxInteractionOrder”, which has a big effect on prediction performance. Genrally, we do not recommend 3rd or higher order interactions, because of the computation burden and little performance improvement. For high dimensional data such as the mouse data set, we recommend using no interaction at all, because we already have enough features to produce instability in GLMs. But here for illustration purpose, we compare “no interaction” to “2nd interaction”. Note that we use only 50 bags instead of default 100 in the following to facilitate parameter tuning. # RGLM RGLM = randomGLM(x, y, classify=FALSE, nBags=50, keepModels=TRUE) # accuracy accuracyM(RGLM$predictedOOB, y) # [1] 0.5870843 # RGLM with pairwise interaction between features # Parallel running is highly recommended, which is implemented by parameter nThreads. RGLM.inter2 = randomGLM(x, y, classify=FALSE, maxInteractionOrder=2, nBags=50, keepModels=TRUE) # accuracy 2 accuracyM(RGLM.inter2$predictedOOB, y) # 0.5792032 As expected, adding pairwise interaction terms to these gene expression data is not worth the trouble. As a matter of fact, interaction terms lead to a slightly decreased OOB accuracy. In general, we advise against pairwise interaction terms when dealing with gene expression data. 4. Feature selection parameters tuning (focusing on one parameter at a time) There are two major feature selection parameters that affect the performance of RGLM: nFeaturesInBag and nCandidateCovariates. nFeaturesInBag controls the number of features randomly selected in each bag (random subspace). nCandidateCovariates controls the number of covariates used in GLM model selection. Now we show how to sequentially tune these 2 parameters. # choose nFeaturesInBag # consider the following proportions of the total features proportionOfFeatures= c(0.1, 0.2, 0.4, 0.6, 0.8, 1) nFeatureInBagVector=ceiling(proportionOfFeatures *N) # define vector that saves prediction accuracies acc = rep(NA, length(nFeatureInBagVector)) # loop over nFeaturesInBag values, calculate individual accuracy for (i in 1:length(nFeatureInBagVector)) { cat("step", i, "out of ", length(nFeatureInBagVector), "entries from nFeatureInBagVector\n") RGLMtmp = randomGLM(x, y, classify=FALSE, nFeaturesInBag = nFeatureInBagVector[i], nBags=50, keepModels=TRUE) predicted = RGLMtmp$predictedOOB acc[i] = accuracyM(predicted, y) rm(RGLMtmp, predicted) } data.frame(proportionOfFeatures, nFeatureInBagVector,acc) # Accuracy is highest when nFeatureInBag takes 60% of all features. # view by plot pdf("~/Desktop/gene_screening/package/nFeaturesInBagQuantitative.pdf",5,5) plot(nFeatureInBagVector,acc,ylab="OOB accuracy (correlation)",xlab="nFeatureInBag",main="Choosing nFeatureInBag",type="l") text(nFeatureInBagVector,acc,lab= nFeatureInBagVector) dev.off() 3 Here nFeaturesInBag equal to 60% of all features (resulting in 3000 features) results in the highest OOB prediction accuracy. In the following, we assume that nFeaturesInBag has been fixed to 3000. # choose nCandidateCovariates # consider 6 values nCandidateCovariatesVector=c(5,10,20,30,50,75,100) # define vector that saves prediction accuracies acc1 = rep(NA, length(nCandidateCovariatesVector)) # loop over nCandidateCovariates values, calculate individual accuracy for (j in 1:length(nCandidateCovariatesVector)) { cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from nCandidateCovariatesVector\n") RGLMtmp = randomGLM(x, y, classify=FALSE, nFeaturesInBag = ceiling(0.6*N), nCandidateCovariates = nCandidateCovariatesVector[j], nBags=50, keepModels=TRUE) predicted = RGLMtmp$predictedOOB acc1[j] = accuracyM(predicted, y) rm(RGLMtmp, predicted) } data.frame(nCandidateCovariatesVector,acc1) # nCandidateCovariatesVector acc1 #1 5 0.5624270 #2 10 0.5670274 #3 20 0.5822062 4 #4 30 0.6000472 #5 50 0.6061715 #6 75 0.6036171 #7 100 0.5783763 # nCandidateCovariates in range 30-75 gives the highest accuracy. Therefore, we choose nCandidateCovariates=50, which equals the default value of this parameter. # view by plot pdf("~/Desktop/gene_screening/package/nCandidateCovariatesQuantitative.pdf" ,5,5) plot(nCandidateCovariatesVector,acc1,ylab="OOB accuracy (correlation)",xlab="nCandidateCovariates",main="Choosing nCandidateCovariates",type="l") text(nCandidateCovariatesVector,acc1,lab= nCandidateCovariatesVector) dev.off() nCandidateCovariates=50 corresponds to the highest accuracy. Note that this is also the default value for nCandidateCovariates. 5. Optimzing the parameter values by varying both at the same time. Previously, we tuned (chose) nFeaturesInBag and nCandidateCovariates one at a time. But clearly it is preferable to consider both parameter choices at the same time, i.e. to see how the accuracy changes over a grid of possible values. # choose nFeaturesInBag and nCandidateCovariates at the same time nFeatureInBagVector=ceiling(c(0.1, 0.2, 0.4, 0.6, 0.8, 1)*N) nCandidateCovariatesVector=c(5,10,20,30,50,75,100) # define vector that saves prediction accuracies 5 acc2=matrix(NA,length(nFeatureInBagVector),length(nCandidateCovariatesVecto r)) rownames(acc2) = paste("feature", nFeatureInBagVector, sep="") colnames(acc2) = paste("cov", nCandidateCovariatesVector, sep="") # loop over nFeaturesInBag and nCandidateCovariates values, calculate individual accuracy for (i in 1:length(nFeatureInBagVector)) { cat("step", i, "out of ", length(nFeatureInBagVector), "entries from nFeatureInBagVector\n") for (j in 1:length(nCandidateCovariatesVector)) { cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from nCandidateCovariatesVector\n") RGLMtmp = randomGLM(x, y, classify=FALSE, nFeaturesInBag = nFeatureInBagVector[i], nCandidateCovariates = nCandidateCovariatesVector[j], nBags=50, keepModels=TRUE) predicted = RGLMtmp$predictedOOB acc2[i, j] = accuracyM(predicted, y) rm(RGLMtmp, predicted) } } round(acc2,3) # cov5 cov10 cov20 cov30 cov50 cov75 cov100 #feature500 0.549 0.555 0.569 0.582 0.573 0.596 0.575 #feature1000 0.553 0.567 0.582 0.583 0.587 0.597 0.566 #feature2000 0.558 0.563 0.586 0.590 0.590 0.571 0.559 #feature3000 0.562 0.567 0.582 0.600 0.606 0.604 0.578 #feature4000 0.565 0.570 0.575 0.582 0.603 0.581 0.557 #feature5000 0.557 0.566 0.575 0.584 0.582 0.568 0.553 # view by plot # load required library library(WGCNA) pdf("~/Desktop/gene_screening/package/parameterChoiceQuantitative.pdf") par(mar=c(2, 5, 4, 2)) labeledHeatmap( Matrix = acc2, yLabels = rownames(acc2), xLabels = colnames(acc2), colors = greenWhiteRed(100)[51:100], 6 textMatrix = round(acc2,3), setStdMargins = FALSE, xLabelsAngle=0, xLabelsAdj = 0.5, main = "Parameter choice") dev.off() Message: choosing nFeatureInBag=3000 and nCandidateCovariates=50 leads to the highest prediction accuracy. References 1. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID: 23323760DOI: 10.1186/1471-2105-14-5. 2. Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, Schadt E, Thomas A, Drake T, Lusis A, Horvath S: Integrating Genetics and Network Analysis to Characterize Genes Related to MouseWeight. PloS Genetics 2006, 2(2):8. PMID: 16934000 7