Suppl. Material

Technical description of Stage 1 Denote by the matrix corresponding to the data set of interest where is the number of individuals, is the binary dependent variable and is the matrix containing the predictor variables. Stage 1. Iterative elimination process: First step:  Denote by the initial data set and by .  Build a Random Forest on , that is, using all predictor variables and the response (see subsection Random Forest parameters below).  Obtain the ranking of the predictor variables using the chosen measure of importance (see subsection Random Forest importance measures for details). Denote by the ranking vector of variables  Compute the out-of-bag AUC (OOB-AUC) of the Random Forest , namely OOB-AUC1 (see subsection Random Forest prediction and AUC computation for details). Subsequent steps. Step j, j > 1:  Based on the initial ranking r, remove a fraction (by default 20%) of the less important variables from as and denote the resulting matrix of predictors .  Denote by the reduced data set:  Build a Random Forest on , namely .  Compute the OOB-AUC of the Random Forest , namely OOB-AUCj . Repeat step j until  the number of remaining variables is less or equal than k0 (by default k0 = 1). Technical description of Stage 4 For ease of notation we illustrate this process in the case of a 5-foldcross-validation (CV) process that is repeated 20 times.  For m = 1, ..., M = 20 repeat a 5-fold CV process consisting of the following steps: 1. Divide the original data set into 5 subsets: , 2. For j = 1, ..., J = 5 o Perform the AUC-RF feature selection on the learning data set, . o Let and o Use denote the optimal Random Forest (after feature elimination) the set of selected variables. to predict individuals in the test data set (See subsection Random Forest prediction and AUC computation). This provides a vector of probabilities, , corresponding to the proportion of trees yielding Y = 1. 3. Join the predictions of the 5 CV subsets, , , and compute the AUC of these predictions, denoted by CV-AUCm.  Compute the mean  For each variable . , compute its probability of selection as the proportion of times that it has been selected by the AUC-RF method: Random Forest parameters AUC-RF uses Random Forest with the default parameters of the R-package randomForest. The most relevant specifications are ntree = 500 (the number of trees in a forest is 500), mtry = (the number of selected candidate variables in each node is the squared of the total number of variables considered in the current forest) and replace = TRUE, node size = 1, max. nodes = NULL, importance = FALSE, norm.votes = TRUE (see the randomForest documentation for details). The out-of-bag process of the AUC-RF is as in the standard use of RF: bootstrap samples are obtained with replacement, thus about one third of the cases are left out in each tree. These default values can be modified when the randomForest function is called.

Suppl. Material

Related documents

Products

Support

Suppl. Material

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib