Tutorial that illustrates how to interpret the RGLM predictor.

advertisement
Tutorial for the randomGLM R package:
Interpretation of the RGLM predictor
Lin Song, Steve Horvath
In this tutorial, we show how to select important features from RGLM and how to interpret the
ensemble predictor. We use the small, round blue cell tumors (srbct) data set [1,2] as an example
training data set. It is composed of the gene expression profiling of 2308 genes across 63 observations.
The data can be found on our webpage at http://labs.genetics.ucla.edu/horvath/RGLM. No test set is
needed.
1. Data preparation
# load required package
library(randomGLM)
# download data from webpage and load it.
# Importantly, change the path to the data and use back slashes /
setwd("C:/Users/Horvath/Documents/CreateWebpage/RGLM/Tutorials/")
load("srbct.rda")
# check data
dim(srbct$x)
table(srbct$y)
# 1
2
# 40 23
x = srbct$x
y = srbct$y
# number of features
N = ncol(x)
# define function misclassification.rate, for accuracy calculation
if (exists("misclassification.rate") ) rm(misclassification.rate);
misclassification.rate=function(tab){
num1=sum(diag(tab))
denom1=sum(tab)
signif(1-num1/denom1,3)
}
2. RGLM prediction
First we do RGLM prediction with default parameter settings. The prediction accuracy is 0.984, with 1
out of 63 observations being misclassified.
1
RGLM = randomGLM(x, y,
classify=TRUE,
keepModels=TRUE)
tab1 = table(y, RGLM$predictedOOB)
#
1
2
#
1 40
0
#
2
1 22
# accuracy
1-misclassification.rate(tab1)
#[1] 0.9841
3. Feature selection
We define the variable importance measure as the times a feature is selected by forward regression
across all bags (here nBags=100). In this application, a total of 83 features have been used for
prediction. Among them, the features that are repetitively selected (big varImp values) are the most
important ones. Here, we take the top 10 most important features which are selected at least 5 times in
forward regression across 100 bags. These 10 features form the basis of RGLM interpretation. Note
that users can decide the number of most importance features to keep according to their needs.
# variable importance measure
varImp = RGLM$timesSelectedByForwardRegression
sum(varImp>0)
# 83
table(varImp)
#varImp
#
0
#2225
1
2
3
52
12
6
4
3
5
6
1
2
7
8
1
1
9
1
10
1
14
1
15
17
1
1
# select most important features
impF = colnames(x)[varImp>=5]
impF
# [1] "G246"
"G545"
"G566"
"G1074" "G1319" "G1327" "G1389" "G1954" "G2050"
#[10] "G2117"
4. RGLM interpretation
We build a single GLM to explain the outcome with the 10 most important features only. G566 and
G1327 are negatively associated with the outcome, while other features are positively associated with
the outcome.
# build single GLM model with most important features
model1
=
glm(y~.,
data=as.data.frame(x[,
binomial(link='logit'))
model1
#Coefficients:
2
impF]),
family
=
#(Intercept)
#
-29.2645
#
G1327
#
-2.1011
G246
G545
G566
G1074
G1319
7.1445
5.0429
-6.9307
3.1406
0.7925
G1389
G1954
G2050
G2117
4.9900
9.3048
1.3402
3.8649
5. Compare single model prediction with original RGLM prediction
In this section, we aim to see how the prediction from the above single model using top 10 most
important features correspond to the original RGLM prediction. In other words, how well does a
single model pick up the signal from the RGLM ensemble?
To ensure a fair comparison, we use the unbiased out-of-bag (OOB) prediction of original RGLM. For
the single model, we should not use the above model1 directly to make prediction for the srbct data,
because features in model1 were selected based on the same data set and thus bias the prediction.
Instead, we use the leave-one-out (LOO) prediction.
# compare the performance of single model with most important features and
original RGLM
# out-of-bag prediction probabilities from RGLM
predRGLM = RGLM$predictedOOB.response[,2]
# define function to calculate leave-one-out prediction of single model
LOOlogistic = function(y, x, impF)
{
nLoops = length(y)
predLOO = rep(NA, nLoops)
for (ind in 1:nLoops)
{
model
=
glm(y[-ind]~.,
data=as.data.frame(x[-ind,
impF]),
family
=
binomial(link='logit'))
predLOO[ind] = predict(model, newdata=as.data.frame(x[ind, impF, drop=F]),
type="response")
rm(model)
}
predLOO
}
# leave-one-out predictive prob of single model
predLOO = LOOlogistic(y, x, impF)
# leave-one-out prediction accuracy of single model
1-misclassification.rate(table(y, round(predLOO)))
#[1] 0.9841
# Single model LOO prediction achieves the same accuracy as RGLM, and it
misclassified the same one observation as RGLM did.
# plot
library(WGCNA)
pdf("/home/telebaby/Desktop/gene_screening/package/interpret.pdf")
3
verboseScatterplot(predLOO, predRGLM,
xlab = paste("LOO predictive prob of a single model with", length(impF), "most
important features"),
ylab = "RGLM OOB predictive prob",
cex.lab=1.2,
cex.axis=1.2)
abline(lm(predRGLM~predLOO), lwd=2)
dev.off()
This figure shows the LOO predictive probabilities for observations to have outcome “2” using a
single model with 10 most important features (x-axis) against the RGLM OOB predictive probabilities.
Apparently, the single model makes very similar predictions to the original RGLM (cor=0.97, p-value
= 3.6*10-39). Therefore in this application, a single model after RGLM feature selection achieves good
prediction accuracy and is very easy and straightforward to interpret.
6. RGLM model coefficients
Users may also want to look at RGLM model coefficients and follow up on those features with large
coefficients on average. This could be done as follows.
# get coefficients of GLM models
# check coefficients of RGLM bag 1
coef(RGLM$models[[1]])
# OUTPUT
#(Intercept)
#
-61.53254
G1954
158.88250
4
# create matrix of coefficients of features across bags
nBags = length(RGLM$featuresInForwardRegression)
coefMat = matrix(0, nBags, RGLM$nFeatures)
for (i in 1:nBags)
{
coefMat[i,
RGLM$featuresInForwardRegression[[i]]]
=
RGLM$coefOfForwardRegression[[i]]
}
# check mean coefficients of features across bags
coefMean = apply(coefMat, 2, mean)
names(coefMean) = colnames(x)
summary(coefMean)
#
Min.
1st Qu.
#-44.67000
Median
0.00000
0.00000
Mean
3rd Qu.
0.07888
Max.
0.00000
31.27000
coefMean[impF]
#
G246
#
8.122109
#
G1954
#
G545
G566
7.947393
G1074
21.349161
G2050
G1319
4.599621
G1327
18.207984
31.269950
G1389
7.282620
G2117
7.522164 -14.200134
24.644747
References
1. Khan J,Wei JS, Ringner M, Saal LH, Ladanyi M,Westermann F, Berthold F, Schwab M, Antonescu
CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression
profiling
and
artificial
neural
networks.
Nature
Medicine
2001,
7(6):673–679,
[http://dx.doi.org/10.1038/89044].
2. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and
interpretable
ensemble
predictor.
BMC
Bioinformatics
10.1186/1471-2105-14-5.
5
14:5
PMID:
23323760DOI:
Download