Mini Tutorial - UCLA Human Genetics

advertisement
Mini tutorial for the randomGLM R package
Lin Song, Steve Horvath
In this mini tutorial, we briefly show how to fit an RGLM predictor. We illustrate it using the brain
cancer gene expression data used in [1].
1. Binary outcome prediction
library(randomGLM)
# load data
data(mini)
# training data set whose columns contain the features (here genes)
x = mini$x
# Since some of the column names are not unique we change them
colnames(x) = make.names(colnames(x), unique = TRUE)
# outcome of the training data set
y = mini$yB
table(y)
Output:
y
0 1
28 27
# test data set whose columns must equal those of the training data set
xtest = mini$xtest
# make sure that the column names agree...
colnames(xtest) = make.names(colnames(xtest), unique = TRUE)
# outcome of the test data set
ytest = mini$yBtest
table(ytest)
Output:
y
ytest
0 1
33 32
# Fit the RGLM predictor assuming that your computer has only 1 core (nThreads=1).
Here we use the default values of all RGLM parameters. To learn how to choose
different
parameter
values,
please
consider
the
tutorial
RGLMparameterTuningTutorial.docx which is posted on our webpage.
http://labs.genetics.ucla.edu/horvath/RGLM/
RGLM = randomGLM(x, y, xtest, classify=TRUE, keepModels=TRUE, nThreads=1)
# out-of-bag prediction on the training data at the level of the response
predictedOOB = RGLM$predictedOOB
table(y, predictedOOB)
Output:
predictedOOB
y
0 1
0 24 4
1
1 6 21
Message: The OOB prediction is wrong for 4+6=10 observation.
Thus, the OOB estimate of the error rate is (4+6)/length(y)= 0.181818
The OOB estimate of the accuracy is 1-0.181818=0.818181
# This is the test set prediction
predictedTest = RGLM$predictedTest
table(ytest, predictedTest)
Output:
predictedTest
ytest 0 1
0 28 5
1 3 29
Message: The test set prediction is wrong for 3+5=8 observation.
Thus, the test set error rate is (3+5)/length(ytest)= 0.1230769
The test set estimate of the accuracy is 1- 0.1230769=0.8769231
# Class probabilities in the test set.
predictedTest.response = RGLM$predictedTest.response
predictedTest.response
Output:
0
1
dat65_21484_21474 4.206930e-01 0.57930700
dat65_21486_21475 2.898402e-01 0.71015982
dat65_21488_21476 3.814986e-01 0.61850144
dat65_21490_21477 6.999995e-02 0.93000005
..............ETC
Message:
For each test set observation (rows) the output reports the probability of being
class 0 or class 1. By thresholding these values, one obtains the predicted class
outcome reported in RGLM$predictedTest. To choose a different threshold in the
randomGLM function, consider the RGLM parameter thresholdClassProb (default
value 0.5).
# variable importance measures
varImp = RGLM$timesSelectedByForwardRegression
# Create a data frame that reports the variable importance measure of each feature.
datvarImp=data.frame(Feature=as.character(dimnames(RGLM$timesSelectedByForw
ardRegression)[[2]]),timesSelectedByForwardRegression=
as.numeric(RGLM$timesSelectedByForwardRegression))
#Report the 20 most significant features
datvarImp[rank(-datvarImp[,2],ties.method="first")<=20,]
Output:
Feature timesSelectedByForwardRegression
214 200839_s_at
10
299
200986_at
3
452 201280_s_at
4
973 202291_s_at
9
1141
202625_at
7
1224 202803_s_at
5
1285
202953_at
3
1711
204174_at
3
1860 204829_s_at
4
1903
205105_at
3
2210 208306_x_at
6
2
2631
3000
3622
3781
4134
4145
4607
4626
4914
209619_at
212203_x_at
215049_x_at
217362_x_at
218217_at
218232_at
220856_x_at
220980_s_at
38487_at
5
4
4
9
5
9
5
5
5
# Barplot of the importance measures for the 20 most important features
datVarImpSelected=datvarImp[rank(-datvarImp[,2],ties.method="first")<=20, ]
par(mfrow=c(1,1), mar=c(4,8,3,1))
barplot(datVarImpSelected[,2],horiz=TRUE,names.arg=
datVarImpSelected[,1],xlab="Feature Importance",las=2,main="Most significant
features for the RGLM",cex.axis=1,cex.main=1.5,cex.lab=1.5)
Most significant features for the RGLM
38487_at
220980_s_at
220856_x_at
218232_at
218217_at
217362_x_at
215049_x_at
212203_x_at
209619_at
208306_x_at
205105_at
204829_s_at
204174_at
202953_at
202803_s_at
202625_at
202291_s_at
201280_s_at
200986_at
Feature Importance
# indices of features selected into the model in bag 1
RGLM$featuresInForwardRegression[[1]]
Output
X200660_at X202291_s_at X212145_at
Feature.1
119
973
2974
# Model coefficients in bag 1.
coef(RGLM$models[[1]])
Output
(Intercept)
X200660_at X202291_s_at
7979.738216
3
2.002009
5.220940
X212145_at
-36.515803
10
8
6
4
2
0
200839_s_at
2. Quantitative, continuous outcome prediction
library(randomGLM)
# load the data (they are part of the randomGLM package).
data(mini)
# prepare the training data
x = mini$x
# Since some of the column names are not unique we change them
colnames(x) = make.names(colnames(x), unique = TRUE)
y = mini$yC
# prepare the test data
xtest = mini$xtest
colnames(xtest) = make.names(colnames(xtest), unique = TRUE)
ytest = mini$yCtest
# Fit the RGLM predictor
RGLM = randomGLM(x, y, xtest, classify=FALSE, keepModels=TRUE, nThreads=1)
# out-of-bag prediction at the level of response
predictedOOB.response = RGLM$predictedOOB.response
# test set prediction
predictedTest.response = RGLM$predictedTest.response
# variable importance measures
varImp = RGLM$timesSelectedByForwardRegression
# Barplot of the importance measures for the 20 most important features
datvarImp=data.frame(Feature=as.character(dimnames(RGLM$timesSelectedByForw
ardRegression)[[2]]),timesSelectedByForwardRegression=
as.numeric(RGLM$timesSelectedByForwardRegression))
datVarImpSelected=datvarImp[rank(-datvarImp[,2],ties.method="first")<=20, ]
par(mfrow=c(1,1), mar=c(4,8,3,1))
barplot(datVarImpSelected[,2],horiz=TRUE,names.arg=
datVarImpSelected[,1],xlab="Feature Importance",las=2,main="Most significant
features for the RGLM",cex.axis=1,cex.main=1.5,cex.lab=1.5)
Most significant features for the RGLM
221729_at
221698_s_at
219607_s_at
218217_at
213798_s_at
212588_at
210314_x_at
209166_s_at
208146_s_at
204924_at
204829_s_at
204787_at
204682_at
204007_at
203561_at
203028_s_at
202957_at
202625_at
202283_at
Feature Importance
4
20
15
10
5
0
202087_s_at
# indices of features selected into the model of bag 1
RGLM$featuresInForwardRegression[[1]]
Output
X218232_at X203175_at X200986_at X216913_s_at X32128_at X208451_s_at
X216231_s_at X208885_at
Feature.1
4145
1378
299
3757
4865
2227
3713
2353
X220856_x_at
X212588_at
X202625_at
X218831_s_at
X208961_s_at
X201041_s_at X201850_at X212119_at
Feature.1
4607
3155
ETC
# Model coefficients of bag 1
coef(RGLM$models[[1]])
Output
(Intercept)
X218232_at
X203175_at
X200986_at
X216913_s_at
X32128_at X208451_s_at
-4.038310e+02
1.350274e-01
8.108334e-02 -3.471803e-02
2.447514e-01
8.666125e-01 7.484849e-02
X216231_s_at
X208885_at
X220856_x_at
X212588_at
X202625_at
X218831_s_at X208961_s_at
-4.420324e-02 -3.588560e-02
1.852245e-02 -4.221644e-01
2.107064e-01
-2.589818e-01 1.694786e-01
X201041_s_at
X201850_at
X212119_at
X201954_at
X204682_at
X204053_x_at
X201887_at
-3.752358e-02
6.106237e-02
1.290596e-01 -2.070730e-01 -2.556034e-01
3.274343e-01 -3.154036e-01
X200660_at
X214836_x_at
X217947_at
X211990_at
X200620_at
X202253_s_at
X202237_at
1.046733e-01 -3.279971e-02 -8.911381e-02
9.909101e-02
2.114678e-01
4.784094e-01 1.220026e-01
X211963_s_at
X213975_s_at
X202833_s_at
X201438_at
X218473_s_at
X208894_at X219059_s_at
-9.914986e-02
3.551280e-02 -1.968306e-01
3.784942e-02 -7.550249e-02
2.204820e-02 -1.191214e-01
X219505_at X202353_s_at
X203882_at
X217748_at X215121_x_at
9.010647e-02 -6.963570e-02 -8.129762e-03 3.327119e-02 2.239568e-03
References
1. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and
interpretable
ensemble
10.1186/1471-2105-14-5.
5
predictor.
BMC
Bioinformatics
14:5
PMID:
23323760DOI:
Download