Tutorial for the randomGLM R package: prediction with UCI machine learning benchmark data Lin Song, Steve Horvath 1. Data preparation Here we illustrate how to make prediction with the UCI machine learning benchmark data. We start by loading required libraries and preparing data. The Ionosphere data (from the R package mlbench) contains 34 features across 351 samples. The data have one binary outcome. In our paper [1], we averaged the 3 fold cross validation estimates of the accuracies across 100 random partitions into 3 folds. But here we do not use cross validation. Instead, we split the data into training data (comprised of 2/3 of the observations) and test data (1/3 of the observations). # load required libraries library(randomGLM) library(mlbench) options(stringsAsFactors=F) # load data data(Ionosphere) # check data head(Ionosphere) # check outcome variable y0 = as.factor(Ionosphere[,35]) table(y0) # y0 # bad good # 126 225 # define features x0 = as.matrix(Ionosphere[, -35]) mode(x0) = "numeric" # m is total number of samples m = nrow(x0) # split data into 2/3 training and 1/3 test set.seed(1) indx = sample(1:m, ceiling(m/3)) x = x0[-indx,] xtest = x0[indx,] y = y0[-indx] ytest = y0[indx] 2. Define accuracy measures # define accuracyMeasure library(WGCNA) accuracyM =function(predicted, y))[2,2]} y) {accuracyMeasures(table(predicted, 3. RGLM prediction Now we have training data x (234 samples x 34 features), training outcome y, test data xtest (117 samples x 34 features) and test outcome ytest. Prediction is done as follows. Make sure to set “classify=TRUE” for binary outcome prediction. # RGLM prediction RGLM = randomGLM(x, y, xtest, classify=TRUE, keepModels=TRUE, nThreads=1) # test set prediction predictedTest = RGLM$predictedTest # accuracy accuracyM(predictedTest, ytest) # [1] 0.8376068 Note that parameter nFeaturesInBag of the randomGLM function controls the number of features randomly selected into each bag. Here we use the default value which is (1.0276 − 0.00276 × 34) × 34 ≈ 32 . We encourage users to optimize this parameter by, for example using OOB estimates of the accuracy. How to do this is described in another tutorial (see RGLMparameterTuningTutorial.docx) which is posted on our webpage: http://labs.genetics.ucla.edu/horvath/RGLM 4. RGLM prediction with interaction terms between features When the total number of features is small (like in most UCI benchmark data sets), we recommend adding pairwise interaction terms between features [1]. This can be done by setting parameter "maxInteractionOrder=2". The following calculation takes about 6 minutes. # RGLM prediction with pairwise feature interactions RGLM.inter2 = randomGLM(x, y, xtest, classify=TRUE, maxInteractionOrder=2, keepModels=TRUE, nThreads=1) # test set prediction predictedTest.inter2 = RGLM.inter2$predictedTest accuracyM(predictedTest.inter2, ytest) # [1] 0.9145299 With pairwise interaction terms, the prediction accuracy increases from 0.84 to 0.91. Users can consider higher order interaction terms by increasing maxInteractionOrder (e.g. to 3). But the computation may be very intensive and the performance will probably not improve [1]. 5. Prediction of other common predictors In this section, we provide the code for making predictions using other common predictors, including random forest, recursive partitioning, linear discriminant analysis, K-nearest neighbors, support vector machine, shrunken centroid and penalized regression models. #load required libraries library(randomForest) library(class) library(rpart) library(tree) library(e1071) library(pamr) library(supclust) library(glmnet) # change outcome levels from bad, good to 1 and 2, so that all methods can apply levels(y) = levels(ytest) = 1:2 # define matrix to save predicted values by different methods method = c("RF", "RFbigmtry", "Rpart", "LDA", "DLDA", "KNN", "SVM", "SC", "GLMNET0", "GLMNET0.5", "GLMNET1") predictedMat = matrix(NA, length(ytest), length(method)) # random forest RF=randomForest(x, y, xtest, importance=F) predictedMat[, 1] = RF$test$predicted RFbigmtry = randomForest(x, y, xtest, mtry=ncol(x), importance=F) predictedMat[, 2] = RFbigmtry$test$predicted # recursive partitioning rp1 = rpart(y~., data=data.frame(x)) predictedMat[, 3] = predict(rp1, type="class") newdata=data.frame(xtest), # linear discriminant analysis lda1 = lda(y~., data=data.frame(x[,-2]), CV=F, method = "moment") predictedMat[, 4] = predict(lda1, newdata=data.frame(xtest))$class #diagonal linear discriminant analysis # it only takes numeric 0/1 coding dlda1 = dlda(x, xtest, as.numeric(y)-1) predictedMat[, 5] = dlda1+1 # use cross validation to determine k for knn fold = 3 k_knn = seq(1, 21, 2) acc_knn_cv = matrix(NA, fold, length(k_knn)) for (i in 1:fold) { cvIndx = sample(1:nrow(x), round(nrow(x)/fold)) for (j in 1:length(k_knn)) { acc_knn_cv[i, j] = accuracyM( y[cvIndx], knn(train=x[ -cvIndx, ], test=x[cvIndx,], cl=y[-cvIndx], k=k_knn[j])) } } k_knn_best = k_knn[which.max(apply(acc_knn_cv, 2, mean))] predictedMat[, 6] = knn(train=x, test=xtest, cl=y, k=k_knn_best) # support vector machine svm1 = svm(y~., data=data.frame(x)) predictedMat[, type="class") 7] = predict(svm1, newdata=data.frame(xtest), # shrunken centroids dat1_sc = list(x=t(x), y=y) sc1 = pamr.train(dat1_sc) sc_cv = pamr.cv( sc1, dat1_sc) # if tie, take the first threshold threshold_sc = sc_cv$threshold[which.min(sc_cv$error)] predictedMat[, 8] = pamr.predict(sc1, threshold=threshold_sc, type="class") newx=t(xtest), # penalized regression models # alpha values: 0 -- ridge regression, 0.5 -- elastic net, 1-- lasso. alpha = c(0, 0.5, 1) cv = cv.glmnet(x, y, family="binomial") lambda = cv$lambda.min for (i in 1:length(alpha)) { model = glmnet(x, y, family="binomial", alpha=alpha[i]) predictedMat[, 8+i] = predict(model, xtest, type="class", s=lambda) rm(model) } # modify accuracy measure accuracyM =function(predicted, y) {accuracyMeasures(table(factor(predicted, levels=1:2), y))[2,2]} # accuracy accOther = apply(predictedMat, 2, accuracyM, ytest) print(data.frame(method, accOther)) # method accOther #1 RF 0.8974359 #2 RFbigmtry 0.8888889 #3 Rpart 0.8205128 #4 LDA 0.8119658 #5 DLDA 0.4529915 #6 KNN 0.7777778 #7 SVM 0.8888889 #8 SC 0.7863248 #9 GLMNET0 0.8717949 #10 GLMNET0.5 0.8547009 #11 GLMNET1 0.8290598 Comparing the accuracies with that of RGLM.inter2, we can see that RGLM.inter2 outperforms other methods in this machine learning benchmark data set. References 1. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID: 23323760DOI: 10.1186/1471-2105-14-5.