Tutorial for the randomGLM R package: Interpretation of the RGLM predictor Lin Song, Steve Horvath In this tutorial, we show how to select important features from RGLM and how to interpret the ensemble predictor. We use the small, round blue cell tumors (srbct) data set [1,2] as an example training data set. It is composed of the gene expression profiling of 2308 genes across 63 observations. The data can be found on our webpage at http://labs.genetics.ucla.edu/horvath/RGLM. No test set is needed. 1. Data preparation # load required package library(randomGLM) # download data from webpage and load it. # Importantly, change the path to the data and use back slashes / setwd("C:/Users/Horvath/Documents/CreateWebpage/RGLM/Tutorials/") load("srbct.rda") # check data dim(srbct$x) table(srbct$y) # 1 2 # 40 23 x = srbct$x y = srbct$y # number of features N = ncol(x) # define function misclassification.rate, for accuracy calculation if (exists("misclassification.rate") ) rm(misclassification.rate); misclassification.rate=function(tab){ num1=sum(diag(tab)) denom1=sum(tab) signif(1-num1/denom1,3) } 2. RGLM prediction First we do RGLM prediction with default parameter settings. The prediction accuracy is 0.984, with 1 out of 63 observations being misclassified. 1 RGLM = randomGLM(x, y, classify=TRUE, keepModels=TRUE) tab1 = table(y, RGLM$predictedOOB) # 1 2 # 1 40 0 # 2 1 22 # accuracy 1-misclassification.rate(tab1) #[1] 0.9841 3. Feature selection We define the variable importance measure as the times a feature is selected by forward regression across all bags (here nBags=100). In this application, a total of 83 features have been used for prediction. Among them, the features that are repetitively selected (big varImp values) are the most important ones. Here, we take the top 10 most important features which are selected at least 5 times in forward regression across 100 bags. These 10 features form the basis of RGLM interpretation. Note that users can decide the number of most importance features to keep according to their needs. # variable importance measure varImp = RGLM$timesSelectedByForwardRegression sum(varImp>0) # 83 table(varImp) #varImp # 0 #2225 1 2 3 52 12 6 4 3 5 6 1 2 7 8 1 1 9 1 10 1 14 1 15 17 1 1 # select most important features impF = colnames(x)[varImp>=5] impF # [1] "G246" "G545" "G566" "G1074" "G1319" "G1327" "G1389" "G1954" "G2050" #[10] "G2117" 4. RGLM interpretation We build a single GLM to explain the outcome with the 10 most important features only. G566 and G1327 are negatively associated with the outcome, while other features are positively associated with the outcome. # build single GLM model with most important features model1 = glm(y~., data=as.data.frame(x[, binomial(link='logit')) model1 #Coefficients: 2 impF]), family = #(Intercept) # -29.2645 # G1327 # -2.1011 G246 G545 G566 G1074 G1319 7.1445 5.0429 -6.9307 3.1406 0.7925 G1389 G1954 G2050 G2117 4.9900 9.3048 1.3402 3.8649 5. Compare single model prediction with original RGLM prediction In this section, we aim to see how the prediction from the above single model using top 10 most important features correspond to the original RGLM prediction. In other words, how well does a single model pick up the signal from the RGLM ensemble? To ensure a fair comparison, we use the unbiased out-of-bag (OOB) prediction of original RGLM. For the single model, we should not use the above model1 directly to make prediction for the srbct data, because features in model1 were selected based on the same data set and thus bias the prediction. Instead, we use the leave-one-out (LOO) prediction. # compare the performance of single model with most important features and original RGLM # out-of-bag prediction probabilities from RGLM predRGLM = RGLM$predictedOOB.response[,2] # define function to calculate leave-one-out prediction of single model LOOlogistic = function(y, x, impF) { nLoops = length(y) predLOO = rep(NA, nLoops) for (ind in 1:nLoops) { model = glm(y[-ind]~., data=as.data.frame(x[-ind, impF]), family = binomial(link='logit')) predLOO[ind] = predict(model, newdata=as.data.frame(x[ind, impF, drop=F]), type="response") rm(model) } predLOO } # leave-one-out predictive prob of single model predLOO = LOOlogistic(y, x, impF) # leave-one-out prediction accuracy of single model 1-misclassification.rate(table(y, round(predLOO))) #[1] 0.9841 # Single model LOO prediction achieves the same accuracy as RGLM, and it misclassified the same one observation as RGLM did. # plot library(WGCNA) pdf("/home/telebaby/Desktop/gene_screening/package/interpret.pdf") 3 verboseScatterplot(predLOO, predRGLM, xlab = paste("LOO predictive prob of a single model with", length(impF), "most important features"), ylab = "RGLM OOB predictive prob", cex.lab=1.2, cex.axis=1.2) abline(lm(predRGLM~predLOO), lwd=2) dev.off() This figure shows the LOO predictive probabilities for observations to have outcome “2” using a single model with 10 most important features (x-axis) against the RGLM OOB predictive probabilities. Apparently, the single model makes very similar predictions to the original RGLM (cor=0.97, p-value = 3.6*10-39). Therefore in this application, a single model after RGLM feature selection achieves good prediction accuracy and is very easy and straightforward to interpret. 6. RGLM model coefficients Users may also want to look at RGLM model coefficients and follow up on those features with large coefficients on average. This could be done as follows. # get coefficients of GLM models # check coefficients of RGLM bag 1 coef(RGLM$models[[1]]) # OUTPUT #(Intercept) # -61.53254 G1954 158.88250 4 # create matrix of coefficients of features across bags nBags = length(RGLM$featuresInForwardRegression) coefMat = matrix(0, nBags, RGLM$nFeatures) for (i in 1:nBags) { coefMat[i, RGLM$featuresInForwardRegression[[i]]] = RGLM$coefOfForwardRegression[[i]] } # check mean coefficients of features across bags coefMean = apply(coefMat, 2, mean) names(coefMean) = colnames(x) summary(coefMean) # Min. 1st Qu. #-44.67000 Median 0.00000 0.00000 Mean 3rd Qu. 0.07888 Max. 0.00000 31.27000 coefMean[impF] # G246 # 8.122109 # G1954 # G545 G566 7.947393 G1074 21.349161 G2050 G1319 4.599621 G1327 18.207984 31.269950 G1389 7.282620 G2117 7.522164 -14.200134 24.644747 References 1. Khan J,Wei JS, Ringner M, Saal LH, Ladanyi M,Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 2001, 7(6):673–679, [http://dx.doi.org/10.1038/89044]. 2. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 10.1186/1471-2105-14-5. 5 14:5 PMID: 23323760DOI: