Supplemental Digital Content 2.doc Steps in initial variable selection

advertisement
Supplemental Digital Content 2.doc
Steps in initial variable selection:
General:
All models used the same 76 variables (as reported in Supplemental Digital
Content 1). All models were developed on 50% of the dataset and tested on
the remaining 50%. Models were generated using the IBM SPSS Modeler ver.
21.
Generation of neural network (NN) models:
We used the Multi Layer Perception (MLP) method to generate five different
NN models. Each model differed in terms of the random seed number
applied.
Generation of decision trees:
We generated three types of decision trees - Chi-squared Automatic
Interaction Detection (CHAID), a classification system that uses chi-square
tests to identify optimal cut-points; C 5.0 - a recursive classification system
based on entropy rules; and Classification and Regression Tree (CART) - a
recursive classification system based on impurity rules.
Model testing:
We tested the ability of each of the eight models (5 NN models and the three
decision trees) to discriminate between patients with and without a
readmission at various cut points – at the 5% and 10% highest risk. The
following parameters were tested: positive predictive value (PPV = 'hit rate'),
the percentage of people whom were actually readmitted among those
1
determined as high-risk for readmission, at each of the 5% and 10% cut
points according to the various models; and Lift, the ratio between PPV and
the average occurrence in the population. The PPV ranged between 29%-35%
for the 10% highest risk and 37-42% for the 5% highest risk. The lift ranged
between 1.8 – 2.8 for the 5% or 10% highest risks.
Model selection:
Of the eight models we chose those with a PPV of 30% or higher (for the 10%
highest risk) and a Lift of 2.0 or above. Five models met these criteria: two of
the neural network models (termed NN1 and NN2) and the three decision
trees (CHAID, C 5.0, and CART).
Variable selection:
In each model we compared the 20 top ranking variables. This ranking is
provided by the Modeler according to the contribution of each of the
variables to each model. Variables that were ranked as the top 20 variables in
three of the five models, were entered into the multivariate logistic
regression model.
2
Download