Suppl. Material

advertisement
Technical description of Stage 1
Denote by
the
matrix corresponding to the data set of interest where
is the number of individuals, is the binary dependent variable and
is
the
matrix containing the predictor variables.
Stage 1. Iterative elimination process:
First step:
 Denote by
the initial data set
and by
.
 Build a Random Forest
on
, that is, using all predictor variables and
the response (see subsection Random Forest parameters below).
 Obtain the ranking of the predictor variables using the chosen measure of
importance (see subsection Random Forest importance measures for details).
Denote by
the ranking vector of variables
 Compute the out-of-bag AUC (OOB-AUC) of the Random Forest
, namely
OOB-AUC1 (see subsection Random Forest prediction and AUC computation
for details).
Subsequent steps. Step j, j > 1:
 Based on the initial ranking r, remove a fraction (by default 20%) of the less
important variables from
as
and denote the resulting matrix of predictors
.
 Denote by
the reduced data set:
 Build a Random Forest on
, namely
.
 Compute the OOB-AUC of the Random Forest
, namely OOB-AUCj .
Repeat step j until
 the number of remaining variables is less or equal than k0 (by default k0 = 1).
Technical description of Stage 4
For ease of notation we illustrate this process in the case of a 5-foldcross-validation (CV)
process that is repeated 20 times.
 For m = 1, ..., M = 20 repeat a 5-fold CV process consisting of the following steps:
1. Divide the original data set into 5 subsets:
,
2. For j = 1, ..., J = 5
o Perform the AUC-RF feature selection on the learning data set,
.
o Let
and
o Use
denote the optimal Random Forest (after feature elimination)
the set of selected variables.
to predict individuals in the test data set
(See subsection
Random Forest prediction and AUC computation). This provides a vector
of probabilities,
, corresponding to the proportion of trees
yielding Y = 1.
3. Join the predictions of the 5 CV subsets,
,
, and compute the AUC of these predictions, denoted by
CV-AUCm.

Compute the mean

For each variable
.
, compute its probability of selection as the
proportion of times that it has been selected by the AUC-RF method:
Random Forest parameters
AUC-RF uses Random Forest with the default parameters of the R-package randomForest.
The most relevant specifications are ntree = 500 (the number of trees in a forest is 500), mtry =
(the number of selected candidate variables in each node is the squared of the total
number of variables considered in the current forest) and replace = TRUE, node size = 1, max.
nodes = NULL, importance = FALSE, norm.votes = TRUE (see the randomForest
documentation for details). The out-of-bag process of the AUC-RF is as in the standard use of
RF: bootstrap samples are obtained with replacement, thus about one third of the cases are
left out in each tree. These default values can be modified when the randomForest function
is called.
Download