Tutorial for the randomGLM R package: Parameter tuning based on OOB prediction Lin Song, Steve Horvath In this tutorial, we show how to fine-tune RGLM parameters based on out-of-bag (OOB) prediction performance. Similar to a cross validation estimate, the OOB prediction of the accuracy is nearly unbiased, making it a good criterion for judging how parameter choices affect the prediction accuracy. Data: We use the small, round blue cell tumors (srbct) data set [1,2] as an example training data set. The data can be found on our webpage at http://labs.genetics.ucla.edu/horvath/RGLM. It is composed of the gene expression profiling of 2308 genes across 63 observations. 1. Data preparation # load required package library(randomGLM) # download data from webpage and load it. # Importantly, change the path to the data and use back slashes / setwd("C:/Users/Horvath/Documents/CreateWebpage/RGLM/Tutorials/") load("srbct.rda") # check data dim(srbct$x) table(srbct$y) Output: 1 2 40 23 Message: Binary outcome that is poorly balanced since 40/(40+23)=64% of observations are class 1. x = srbct$x y = srbct$y # number of features N = ncol(x) If there is a large number of observations in the training set, for example >2000, the computation of RGLM will become very time consuming. We suggest doing parameter tuning with random subset of observations. Example code is as follows: 1 nObsSubset=500 subsetObservations=sample(1:dim(x)[[1]], nObsSubset,replace=FALSE) x.Subset=x[subsetObservations,] y.Subset=y[subsetObservations] 2. Define accuracy measures Many measures can be used to evaluate prediction accuracy, such as proportion classified correctly, sensitivity, specificity, area under ROC and C index [3]. Users can define a measure according to their needs. Here we simply use proportion classified correctly. An illustration of C index is also provided at the end of this tutorial. accuracyM=function(tab){ num1=sum(diag(tab)) denom1=sum(tab) signif(num1/denom1,3) } 3. Should one include pairwise interactions between the features? RGLM allows interaction terms added among features in GLM construction through parameter “maxInteractionOrder”, which has a big effect on prediction performance. Genrally, we do not recommend 3rd or higher order interactions, because of the computation burden and little performance improvement. For high dimensional data such as the srbct data set, we recommend using no interaction at all, because we already have enough features to produce instability in GLMs. But here for illustration purpose, we compare “no interaction” to “2nd interaction”. Note that we use only 50 bags instead of default 100 in the following to facilitate parameter tuning. # Fit the RGLM with 50 bags but we recommend you use 100 bags RGLM = randomGLM(x, y,classify=TRUE,nBags=50,keepModels=TRUE) # accuracy accuracyM(table(RGLM$predictedOOB, y)) # [1] 1 # RGLM with pairwise interaction between features # The following calculation could take up to 50 minutes using a single core (i.e. nThreads=1). Parallel running is highly recommended, which is implemented by parameter nThreads. As you can learn from the help file or randomGLM, If nThreads is left at its default value (i.e. NULL), it will be determined automatically as the number of available cores if the latter is 3 or less, and number of cores minus 1 if the number of available cores is 4 or more. RGLM.inter2=randomGLM(x, y, classify=TRUE, maxInteractionOrder=2, 2 nBags=50,keepModels=TRUE,nThreads=NULL) # accuracy accuracyM(table(RGLM.inter2$predictedOOB, y)) # [1] 0.9841 As expected, adding pairwise interaction terms decreases the OOB accuracy from 1 to 0.984. Therefore, it is not worth the trouble to consider interaction here. In general, we do not advise pairwise interaction terms when dealing with gene expression data or other high dimensional genomic data. 4. Feature selection parameters tuning (focusing on one parameter at a time) There are two major feature selection parameters that affect the performance of RGLM: nFeaturesInBag and nCandidateCovariates. nFeaturesInBag controls the number of features randomly selected in each bag (random subspace). nCandidateCovariates controls the number of covariates used in GLM model selection. Now we show how to sequentially tune these 2 parameters. # number of features N = ncol(x) # choose nFeaturesInBag among the following possible values nFeatureInBagVector=ceiling(c(0.1, 0.2, 0.4, 0.6, 0.8, 1)*N) # define vector that saves prediction accuracies acc = rep(NA, length(nFeatureInBagVector)) # loop over nFeaturesInBag values, calculate individual accuracy for (i in 1:length(nFeatureInBagVector)) { cat("step", i, "out of ", length(nFeatureInBagVector), "entries from nFeatureInBagVector\n") RGLMtmp = randomGLM(x,y,classify=TRUE, nFeaturesInBag = nFeatureInBagVector[i],nBags=50,keepModels=TRUE) predicted = RGLMtmp$predictedOOB acc[i] = accuracyM(table(predicted, y)) rm(RGLMtmp, predicted) } data.frame(nFeatureInBagVector,acc) # nFeatureInBagVector acc #1 231 0.9683 #2 462 1.0000 #3 924 0.9841 #4 1385 0.9683 #5 1847 0.9841 #6 2308 0.9841 # Accuracy is highest when nFeatureInBag takes 20% of all features. # view by plot, choose nFeaturesInBag = 462. pdf("~/Desktop/gene_screening/package/nFeaturesInBag.pdf",5,5) plot(nFeatureInBagVector,acc,ylab="OOB 3 accuracy",xlab="nFeatureInBag",main="Choosing nFeatureInBag",type="l") text(nFeatureInBagVector,acc,lab= nFeatureInBagVector) dev.off() Message: any of the considered choices leads to a nearly perfect OOB estimate of the accuracy. Here nFeaturesInBag equal to 20% of all features (resulting in 462 features) results in the highest OOB prediction accuracy. Note that this optimized value is the same as the default choice of nFeaturesInBag. How to choose nCandidateCovariates? In the following, we assume that nCandidateCovariates has been set to its default value, i.e. nFeaturesInBag=ceiling(0.2*N). # We consider the following choices for nCandidateCovariates nCandidateCovariatesVector=c(5,10,20,30,50,75,100) # define a vector that stores the resulting OOB estimates of the accuracies acc1 = rep(NA, length(nCandidateCovariatesVector)) # loop over nCandidateCovariates values, calculate individual accuracy for (j in 1:length(nCandidateCovariatesVector)) { cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from nCandidateCovariatesVector\n") RGLMtmp=randomGLM(x,y,classify=TRUE, nFeaturesInBag = ceiling(0.2*N), nCandidateCovariates=nCandidateCovariatesVector[j],nBags=50,keepModels=T RUE) predicted = RGLMtmp$predictedOOB acc1[j] = accuracyM(table(predicted, y)) 4 rm(RGLMtmp, predicted) } data.frame(nCandidateCovariatesVector,acc1) # nCandidateCovariatesVector acc1 #1 5 1 #2 10 1 #3 20 1 #4 30 1 #5 50 1 #6 75 1 #7 100 1 # All accuracies are equal to 1, suggesting that nCandidateCovariates does not have a significant effect on the prediction accuracy in this data set. # Here is a plot that shows how the accuracy (y-axis) depends on nCandidateCovariates (x-axis). pdf("~/Desktop/gene_screening/package/nCandidateCovariates.pdf",5,5) plot(nCandidateCovariatesVector,acc1,ylab="OOB accuracy",xlab="nCandidateCovariates",main="Choosing nCandidateCovariates",type="l") text(nCandidateCovariatesVector,acc1,lab= nCandidateCovariatesVector) dev.off() All accuracies are equal to 1, suggesting that nCandidateCovariates do not have a significant effect on prediction in this data set. In this case, nCandidateCovariates=5 is preferred because of computational reasons. Recall that the computation time greatly depends on nCandidateCovariates. 5 5. Optimzing the parameter values by varying both at the same time. Previously, we tuned (chose) nFeaturesInBag and nCandidateCovariates one at a time. But clearly it is preferable to consider both parameter choices at the same time, i.e. to see how the accuracy changes over a grid of possible values. # choose nFeaturesInBag and nCandidateCovariates at the same time nFeatureInBagVector=ceiling(c(0.1, 0.2, 0.4, 0.6, 0.8, 1)*N) nCandidateCovariatesVector=c(5,10,20,30,50,75,100) #Note that nFeatureInBagVector will be the rows of the grid and #nCandidateCovariatesVector will be the columns of the grid. For each grid value, #we calculate the OOB estimate of the accuracy and store it in the following matrix. acc2=matrix(NA,length(nFeatureInBagVector), length(nCandidateCovariatesVector)) rownames(acc2) = paste("feature", nFeatureInBagVector, sep="") colnames(acc2) = paste("cov", nCandidateCovariatesVector, sep="") # loop over nFeaturesInBag and nCandidateCovariates values, and calculate the resulting OOB estimate of the accuracy. for (i in 1:length(nFeatureInBagVector)) { cat("step", i, "out of ", length(nFeatureInBagVector), "entries from nFeatureInBagVector\n") for (j in 1:length(nCandidateCovariatesVector)) { cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from nCandidateCovariatesVector\n") RGLMtmp=randomGLM(x,y,classify=TRUE,nFeaturesInBag=nFeatureInBagVector[i], nCandidateCovariates=nCandidateCovariatesVector[j],nBags=50,keepModels=TRUE ) predicted = RGLMtmp$predictedOOB acc2[i, j] = accuracyM(table(predicted, y)) rm(RGLMtmp, predicted) } } acc2 # cov5 cov10 cov20 cov30 cov50 cov75 cov100 #feature231 0.9841 0.9365 0.9365 0.9683 0.9683 0.9683 0.9683 #feature462 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 #feature924 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 #feature1385 0.9683 0.9683 0.9683 0.9683 0.9683 0.9683 0.9683 #feature1847 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 #feature2308 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 0.9841 6 # Create a heatmap plot visualizes the values in the above matrix # load required library library(WGCNA) # the following is not important for generating a small heat map allowWGCNAThreads() # comment: If you get an error message, please ignore it. # disableWGCNAThreads() pdf("~/Desktop/gene_screening/package/parameterChoice.pdf") par(mar=c(2, 5, 4, 2)) labeledHeatmap(Matrix = acc2,yLabels = rownames(acc2),xLabels = colnames(acc2), colors = greenWhiteRed(100)[51:100],textMatrix = acc2, setStdMargins = FALSE, xLabelsAngle=0,xLabelsAdj = 0.5,main = "Parameter choice") dev.off() Message: There is a very strong signal in the data set: any parameter combination between nFeatureInBagVector (rows) and nCandidateCovariatesVector (columns) leads to a nearly perfect (OOB estimate of the) accuracy. For example, choosing nFeatureInBagVector=462 leads to an accuracy of 1 if nCandidateCovariatesVector lies between 5 and 100. 6. Effects of number of bags on the OOB prediction accuracy In RGLM, the default number of bags is set to 100. But for previous parameter tuning, we use only 50 bags to save time. In the following we show how the number of bags in RGLM affects the OOB prediction accuracy. # choose nBags # consider 5 values 7 nBagsVector=c(10, 20, 50, 100, 200) # define vector that saves prediction accuracies acc3 = rep(NA, length(nBagsVector)) # loop over nBags values, calculate individual accuracy for (k in 1:length(nBagsVector)) { print(k) RGLMtmp = randomGLM(x, y, classify=TRUE, nBags=nBagsVector[k], keepModels=TRUE) predicted = RGLMtmp$predictedOOB acc3[k] = accuracyM(table(predicted, y)) rm(RGLMtmp, predicted) } data.frame(nBagsVector,acc3) # nBagsVector acc3 #1 10 0.9672 #2 20 0.9683 #3 50 1.0000 #4 100 0.9841 #5 200 1.0000 pdf("~/Desktop/gene_screening/package/nBags.pdf",5,5) plot(nBagsVector,acc3,ylab="OOB accuracy",xlab="nBags",main="Choosing nBags",type="l") text(nBagsVector,acc3,lab= nBagsVector) dev.off() We can see that OOB accuracy increases in the first few dozens of bags. 50 bags is enough for 8 parameter tuning. 7. RGLM parameter tuning based on C index As mentioned previously, users can define their own accuracy measures to tune RGLM parameters. Here we briefly illustrate how to choose parameters based on Harrell’s C index [3]. # define accuracy measure library(Hmisc) Cindex=function(classProbabilities, y) { rcorr.cens(classProbabilities[,2], y, outx=TRUE)[1]} accuracyMClassProb=Cindex # choose nFeaturesInBag and nCandidateCovariates at the same time nFeatureInBagVector=ceiling(c(0.1, 0.2, 0.4, 0.6, 0.8, 1)*N) nCandidateCovariatesVector=c(5,10,20,30,50,75,100) # define vector that saves prediction accuracies acc4 = matrix(NA, length(nFeatureInBagVector), length(nCandidateCovariatesVector)) rownames(acc4) = paste("feature", nFeatureInBagVector, sep="") colnames(acc4) = paste("cov", nCandidateCovariatesVector, sep="") # loop over nFeaturesInBag and nCandidateCovariates values, calculate individual accuracy for (i in 1:length(nFeatureInBagVector)) { cat("step", i, "out of ", length(nFeatureInBagVector), "entries from nFeatureInBagVector\n") for (j in 1:length(nCandidateCovariatesVector)) { cat("step", j, "out of ", length(nCandidateCovariatesVector), "entries from nCandidateCovariatesVector\n") RGLMtmp = randomGLM(x, y, classify=TRUE, nFeaturesInBag = nFeatureInBagVector[i], nCandidateCovariates = nCandidateCovariatesVector[j], nBags=50, keepModels=TRUE) predictedProb = RGLMtmp$predictedOOB.response acc4[i, j] = accuracyMClassProb(predictedProb, y) rm(RGLMtmp, predictedProb) } } acc4 9 # cov5 cov10 cov20 cov30 cov50 cov75 cov100 #feature231 1.0000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 #feature462 1.0000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 #feature924 1.0000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 #feature1385 0.9989130 0.998913 0.998913 0.998913 1.000000 1.000000 1.000000 #feature1847 1.0000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 #feature2308 0.9978261 0.998913 0.998913 0.998913 0.998913 0.998913 0.998913 # view by plot # load required library library(WGCNA) pdf("~/Desktop/gene_screening/package/parameterChoiceCindex.pdf") par(mar=c(2, 5, 4, 2)) labeledHeatmap( Matrix = acc4, yLabels = rownames(acc4), xLabels = colnames(acc4), colors = greenWhiteRed(100)[51:100], textMatrix = signif(acc4,3), setStdMargins = FALSE, xLabelsAngle=0, xLabelsAdj = 0.5, main = "Parameter choice") dev.off() The message is the same as the analysis using proportion classified correctly. There is a very strong signal in the data set: any parameter combination between nFeatureInBagVector (rows) and 10 nCandidateCovariatesVector (columns) leads to a nearly perfect (OOB estimate of the) accuracy. References 1. Khan J,Wei JS, Ringner M, Saal LH, Ladanyi M,Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 2001, 7(6):673–679, [http://dx.doi.org/10.1038/89044]. 2. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID: 23323760DOI: 10.1186/1471-2105-14-5. 3. Harrell Jr, F. E., Califf, R. M., Pryor, D. B., Lee, K. L., & Rosati, R. A. (1982). Evaluating the yield of medical tests. JAMA: the journal of the American Medical Association, 247(18), 2543-2546. 11