Tutorial for UCI machine learning benchmark data

advertisement
Tutorial for the randomGLM R package:
prediction with UCI machine learning benchmark data
Lin Song, Steve Horvath
1. Data preparation
Here we illustrate how to make prediction with the UCI machine learning benchmark data. We
start by loading required libraries and preparing data. The Ionosphere data (from the R package
mlbench) contains 34 features across 351 samples. The data have one binary outcome. In our
paper [1], we averaged the 3 fold cross validation estimates of the accuracies across 100 random
partitions into 3 folds. But here we do not use cross validation. Instead, we split the data into
training data (comprised of 2/3 of the observations) and test data (1/3 of the observations).
# load required libraries
library(randomGLM)
library(mlbench)
options(stringsAsFactors=F)
# load data
data(Ionosphere)
# check data
head(Ionosphere)
# check outcome variable
y0 = as.factor(Ionosphere[,35])
table(y0)
# y0
# bad good
# 126 225
# define features
x0 = as.matrix(Ionosphere[, -35])
mode(x0) = "numeric"
# m is total number of samples
m = nrow(x0)
# split data into 2/3 training and 1/3 test
set.seed(1)
indx = sample(1:m, ceiling(m/3))
x = x0[-indx,]
xtest = x0[indx,]
y = y0[-indx]
ytest = y0[indx]
2. Define accuracy measures
# define accuracyMeasure
library(WGCNA)
accuracyM =function(predicted,
y))[2,2]}
y)
{accuracyMeasures(table(predicted,
3. RGLM prediction
Now we have training data x (234 samples x 34 features), training outcome y, test data xtest (117
samples x 34 features) and test outcome ytest. Prediction is done as follows. Make sure to set
“classify=TRUE” for binary outcome prediction.
# RGLM prediction
RGLM = randomGLM(x, y, xtest,
classify=TRUE,
keepModels=TRUE,
nThreads=1)
# test set prediction
predictedTest = RGLM$predictedTest
# accuracy
accuracyM(predictedTest, ytest)
# [1] 0.8376068
Note that parameter nFeaturesInBag of the randomGLM function controls the number of features
randomly selected into each bag. Here we use the default value which is (1.0276 − 0.00276 ×
34) × 34 ≈ 32 . We encourage users to optimize this parameter by, for example using OOB
estimates of the accuracy. How to do this is described in another tutorial (see
RGLMparameterTuningTutorial.docx)
which
is
posted
on
our
webpage:
http://labs.genetics.ucla.edu/horvath/RGLM
4. RGLM prediction with interaction terms between features
When the total number of features is small (like in most UCI benchmark data sets), we
recommend adding pairwise interaction terms between features [1]. This can be done by setting
parameter "maxInteractionOrder=2". The following calculation takes about 6 minutes.
# RGLM prediction with pairwise feature interactions
RGLM.inter2 = randomGLM(x, y, xtest,
classify=TRUE, maxInteractionOrder=2,
keepModels=TRUE, nThreads=1)
# test set prediction
predictedTest.inter2 = RGLM.inter2$predictedTest
accuracyM(predictedTest.inter2, ytest)
# [1] 0.9145299
With pairwise interaction terms, the prediction accuracy increases from 0.84 to 0.91. Users can
consider higher order interaction terms by increasing maxInteractionOrder (e.g. to 3). But the
computation may be very intensive and the performance will probably not improve [1].
5. Prediction of other common predictors
In this section, we provide the code for making predictions using other common predictors,
including random forest, recursive partitioning, linear discriminant analysis, K-nearest neighbors,
support vector machine, shrunken centroid and penalized regression models.
#load required libraries
library(randomForest)
library(class)
library(rpart)
library(tree)
library(e1071)
library(pamr)
library(supclust)
library(glmnet)
# change outcome levels from bad, good to 1 and 2, so that all methods
can apply
levels(y) = levels(ytest) = 1:2
# define matrix to save predicted values by different methods
method = c("RF", "RFbigmtry", "Rpart", "LDA", "DLDA", "KNN", "SVM",
"SC", "GLMNET0", "GLMNET0.5", "GLMNET1")
predictedMat = matrix(NA, length(ytest), length(method))
# random forest
RF=randomForest(x, y, xtest, importance=F)
predictedMat[, 1] = RF$test$predicted
RFbigmtry = randomForest(x, y, xtest, mtry=ncol(x), importance=F)
predictedMat[, 2] = RFbigmtry$test$predicted
# recursive partitioning
rp1 = rpart(y~., data=data.frame(x))
predictedMat[,
3]
=
predict(rp1,
type="class")
newdata=data.frame(xtest),
# linear discriminant analysis
lda1 = lda(y~., data=data.frame(x[,-2]), CV=F, method = "moment")
predictedMat[, 4] = predict(lda1, newdata=data.frame(xtest))$class
#diagonal linear discriminant analysis
# it only takes numeric 0/1 coding
dlda1 = dlda(x, xtest, as.numeric(y)-1)
predictedMat[, 5] = dlda1+1
# use cross validation to determine k for knn
fold = 3
k_knn = seq(1, 21, 2)
acc_knn_cv = matrix(NA, fold, length(k_knn))
for (i in 1:fold) {
cvIndx = sample(1:nrow(x), round(nrow(x)/fold))
for (j in 1:length(k_knn)) {
acc_knn_cv[i, j] = accuracyM(
y[cvIndx],
knn(train=x[
-cvIndx,
],
test=x[cvIndx,],
cl=y[-cvIndx],
k=k_knn[j]))
}
}
k_knn_best = k_knn[which.max(apply(acc_knn_cv, 2, mean))]
predictedMat[, 6] = knn(train=x, test=xtest, cl=y, k=k_knn_best)
# support vector machine
svm1 = svm(y~., data=data.frame(x))
predictedMat[,
type="class")
7]
=
predict(svm1,
newdata=data.frame(xtest),
# shrunken centroids
dat1_sc = list(x=t(x), y=y)
sc1 = pamr.train(dat1_sc)
sc_cv = pamr.cv( sc1, dat1_sc)
# if tie, take the first threshold
threshold_sc = sc_cv$threshold[which.min(sc_cv$error)]
predictedMat[,
8]
=
pamr.predict(sc1,
threshold=threshold_sc, type="class")
newx=t(xtest),
# penalized regression models
# alpha values: 0 -- ridge regression, 0.5 -- elastic net, 1-- lasso.
alpha = c(0, 0.5, 1)
cv = cv.glmnet(x, y, family="binomial")
lambda = cv$lambda.min
for (i in 1:length(alpha))
{
model = glmnet(x, y, family="binomial", alpha=alpha[i])
predictedMat[, 8+i] = predict(model, xtest, type="class", s=lambda)
rm(model)
}
# modify accuracy measure
accuracyM =function(predicted, y)
{accuracyMeasures(table(factor(predicted, levels=1:2), y))[2,2]}
# accuracy
accOther = apply(predictedMat, 2, accuracyM, ytest)
print(data.frame(method, accOther))
#
method accOther
#1
RF 0.8974359
#2 RFbigmtry 0.8888889
#3
Rpart 0.8205128
#4
LDA 0.8119658
#5
DLDA 0.4529915
#6
KNN 0.7777778
#7
SVM 0.8888889
#8
SC 0.7863248
#9
GLMNET0 0.8717949
#10 GLMNET0.5 0.8547009
#11
GLMNET1 0.8290598
Comparing the accuracies with that of RGLM.inter2, we can see that RGLM.inter2 outperforms
other methods in this machine learning benchmark data set.
References
1. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly
accurate and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID:
23323760DOI: 10.1186/1471-2105-14-5.
Download