Tutorial for quantitative gene traits from the brain cancer data set.

advertisement
Tutorial for the randomGLM R package:
prediction of gene traits
Lin Song, Steve Horvath
1. Binary outcome prediction
1.1 Data preparation and RGLM prediction
Here we use an empirical gene expression data set from the package to explain how RGLM [1]
work. We start by loading required libraries and preparing data. The brain cancer data set
contains a training set (55 samples across 5000 gene features), a test set (65 samples across 5000
features) but no outcomes. We use genes as outcomes instead.
# load required library
library(randomGLM)
# load data
data(brainCancer)
# check data
dim(brainCancer$train)
dim(brainCancer$test)
N = ncol(brainCancer$train)
# sample a quantitative gene trait from all N genes (features), and
exclude it from the feature space
set.seed(1)
traitIndx = sample(1:N, 1)
y = brainCancer$train[, traitIndx]
ytest = brainCancer$test[, traitIndx]
x = brainCancer$train[, -traitIndx]
xtest = brainCancer$test[, -traitIndx]
# Since some of the column names are not unique we change them
colnames(x) = make.names(colnames(x), unique = TRUE)
colnames(xtest) = make.names(colnames(xtest), unique = TRUE);
# generate binary outcomes by dichotomizing continuous outcomes at the
median. Binary outcomes can be factors or numbers.
y = as.factor(ifelse(y>median(y), 1, 0))
ytest = as.factor(ifelse(ytest>median(ytest), 1, 0))
In above, we randomly select a gene from all 5000 genes as the outcome. In practice, users can
change the random seed to generate different gene traits and thus different data sets to play with.
Now we have training data x, training outcome y, test data xtest and test outcome ytest.
Prediction is done as follows. Make sure to set “classify=TRUE” for binary outcome prediction.
This example should take less than 2 minutes.
1
# RGLM prediction using default parameter choices. To learn how to
changer
parameter
values,
look
at
the
tutorial
RGLMparameterTuningTutorial.docx.
# nThreads= 1 assumes that the computer has only 1 core.
RGLM = randomGLM(x,y,xtest,classify=TRUE, keepModels=TRUE, nThreads=1)
# the following alternative code could be faster if your machine has 6 cores or more
#RGLM = randomGLM(x,y,xtest,classify=TRUE, keepModels=TRUE, nThreads=6)
Comment: Here we ignore the warning messages. The warnings indicate that there are problems
with fitting a glm model.
1.2 Predictor outputs
# check out-of-bag prediction in the training data
predictedOOB = RGLM$predictedOOB
table(predictedOOB, y)
#
y
# predictedOOB 0 1
#
0 24 6
#
1 4 21
# check test prediction
predictedTest = RGLM$predictedTest
table(predictedTest, ytest)
#
ytest
# predictedTest 0 1
#
0 28 3
#
1 5 29
# test set prediction can be obtained by the following as well, only
works when setting keepModels=TRUE in randomGLM function
predictedTest = predict(RGLM, newdata = xtest, type="class")
Accuracy can be calculated by hand or using the accuracyMeasures function in the WGCNA R
package. Here, the test set prediction accuracy is 0.88; sensitivity is 0.91; specificity is 0.85.
## test set accuracy measures
library(WGCNA)
# Note: some computer systems support multi-threading, but it is not
#enabled within WGCNA in R. To allow multi-threading within WGCNA with
#all available cores, use
allowWGCNAThreads()
# If you get an error message, please ignore it.
out = accuracyMeasures(table(predictedTest==0, ytest==0))
out
#
Measure
Value
#1
Error.Rate 0.1230769
#2
Accuracy 0.8769231
#3
Specificity 0.8484848
#4
Sensitivity 0.9062500
#5 NegativePredictiveValue 0.9032258
2
#6 PositivePredictiveValue 0.8529412
#7
FalsePositiveRate 0.1515152
#8
FalseNegativeRate 0.0937500
#9
Power 0.9062500
#10 LikelihoodRatioPositive 5.9812500
#11 LikelihoodRatioNegative 0.1104911
#12
NaiveErrorRate 0.4923077
Users can extract GLMs constructed in each bag. For example, the following codes indicate that
the 119th, 973rd and 2974th features are selected into the model of bag 1, and the model is
p
ln (1−p) = 7979.74 + 2.00𝐹𝑒𝑎𝑡𝑢𝑟𝑒119 + 5.22𝐹𝑒𝑎𝑡𝑢𝑟𝑒973 − 36.52𝐹𝑒𝑎𝑡𝑢𝑟𝑒2974 .
# indices of features selected into the model of bag 1
RGLM$featuresInForwardRegression[[1]]
Output
X200660_at X202291_s_at X212145_at
Feature.1
119
973
2974
# Model coefficients of bag 1.
coef(RGLM$models[[1]])
Output
(Intercept)
X200660_at X202291_s_at
7979.738216
2.002009
5.220940
X212145_at
-36.515803
We recommend the following variable importance measure of RGLM. It is the number of times
each feature is selected into GLMs across bags. In this example, we use default total number of
bags = 100. Most genes were not used at all for prediction in any bags (4830 out of 4999). There
is one gene selected 10 times across 100 bags, and 3 genes each selected 9 times across 100 bags.
These genes have the biggest variable importance, and they contribute most to outcome
prediction.
# check variable importance
varImp = RGLM$timesSelectedByForwardRegression
table(varImp)
#
0
1
2
3
4
5
6
7
9
#4830 108
37
8
4
6
1
1
3
10
1
# number of features used for prediction
sum(varImp>0)
#169
# Create a data frame that reports the variable importance measure of
each feature.
datvarImp=data.frame(Feature=as.character(dimnames(RGLM$timesSelectedBy
ForwardRegression)[[2]]),timesSelectedByForwardRegression=
as.numeric(RGLM$timesSelectedByForwardRegression))
#Report the 20 most significant features
datvarImp[rank(-datvarImp[,2],ties.method="first")<=20,]
3
# Barplot of the importance measures for the 20 most important features
datVarImpSelected=datvarImp[rank(datvarImp[,2],ties.method="first")<=20, ]
par(mfrow=c(1,1), mar=c(4,8,3,1))
barplot(datVarImpSelected[,2],horiz=TRUE,names.arg=
datVarImpSelected[,1],xlab="Feature Importance",las=2,main="Most
significant features for the RGLM",cex.axis=1,cex.main=1.5,cex.lab=1.5)
Most significant features for the RGLM
38487_at
220980_s_at
220856_x_at
218232_at
218217_at
217362_x_at
215049_x_at
212203_x_at
209619_at
208306_x_at
205105_at
204829_s_at
204174_at
202953_at
202803_s_at
202625_at
202291_s_at
201280_s_at
200986_at
10
8
6
4
2
0
200839_s_at
Feature Importance
1.3 RGLM thinning
Previously, RGLM achieves prediction accuracy 0.87 with 169 features. Here we want to reduce
the number of features used for prediction while remaining good predictive accuracy. RGLM
thinning does this. This procedure removes features with small variable importance, and only
keeps important features for prediction. As shown in the following, if we set thinning threshold as
4
1, only 61 features with variable importance>1 are kept for prediction. But prediction accuracy
remains almost the same (0.86).
threshold = 1
# thinned RGLM
thinRGLM = thinRandomGLM(RGLM, threshold)
# number of features remained in predictor
thinN = sum(thinRGLM$timesSelectedByForwardRegression>0)
thinN
# [1] 61
# test set prediction after thinning
predicted = predict.randomGLM(thinRGLM, xtest, type="class")
accuracyMeasures(table(predicted, ytest))
#
Measure
Value
#1
Error.Rate 0.1384615
#2
Accuracy 0.8615385
#3
Specificity 0.9062500
#4
Sensitivity 0.8181818
#5 NegativePredictiveValue 0.8285714
#6 PositivePredictiveValue 0.9000000
#7
FalsePositiveRate 0.0937500
#8
FalseNegativeRate 0.1818182
#9
Power 0.8181818
#10 LikelihoodRatioPositive 8.7272727
#11 LikelihoodRatioNegative 0.2006270
#12
NaiveErrorRate 0.4923077
2. Continuous outcome prediction
We still use the same brain cancer data set for illustration. Note that gene traits are not
dichotomized this time, and we set “classify=FALSE” in the randomGLM function.
# start a new R session
# load required library
library(randomGLM)
# load data
data(brainCancer)
# check data
dim(brainCancer$train)
dim(brainCancer$test)
N = ncol(brainCancer$train)
# sample a quantitative gene trait from all N genes (features) and
exclude it from the feature space
set.seed(1)
5
traitIndx = sample(1:N, 1)
y = brainCancer$train[, traitIndx]
ytest = brainCancer$test[, traitIndx]
x = brainCancer$train[, -traitIndx]
xtest = brainCancer$test[, -traitIndx]
# RGLM prediction with 1 thread.
# Consider choosing nThreads=NULL or a larger integer to speed it up.
RGLM=randomGLM(x,y, xtest, classify=FALSE, keepModels=TRUE,nThreads=1)
# define the out-of-bag prediction
predictedOOB = RGLM$predictedOOB.response
# What is the OOB estimate of the accuracy? Recall that for a
continuous trait, the accuracy is defined as the correlation between
OOB prediction and the true, observed training set outcome.
as.numeric(cor(predictedOOB, y, use="p"))
# [1] 0.8712458
# define the test set prediction
predictedTest = RGLM$predictedTest.response
# What is the test set estimate of the accuracy? Recall that for a
continuous trait, the accuracy is defined as the correlation between
test set prediction and the true, observed test set outcome.
as.numeric(cor(predictedTest, ytest, use="p"))
# [1] 0.870373
Both the OOB estimate and the test set estimate of the prediction accuracy show that the RGLM
is quite accuracy (both around 0.87).
References
1. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate
and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID: 23323760DOI:
10.1186/1471-2105-14-5.
6
Download