Strategies for Model Building: A Guideline

Prepared by Sarah Brennenstuhl
Last updated: January 14 2020
Strategies for Model Building
Decide on Type of Model
1. Effect Estimation or Explanatory Models (Inference)- these are models are tools for:
a. Effect estimation (e.g., adjustment for predictors in experimental design),
b. Providing the basis for hypothesis testing (e.g. testing the causal relationship
between a focal variable and outcome, or understanding the effects of multiple
predictors on an outcome)
i. For these models, there is little need for parsimony, goal is to develop
accurate and complex model that best capture the data
2. Prediction Models - these include models with the goal of:
a. Predicting an outcome or predicting new or future observations,
i. For these models, there is need to balance complexity/accuracy and
parsimony – prediction needs to be generalizable to external populations
See Shmueli (2010) for an in depth explanation of the distinction between model types.
Preliminary Steps
1. Variable selection
a. A priori methods of variable selection are recommended based on subject matter
knowledge (e.g. clinical experience, theoretical knowledge, empirical evidence)
b. Step-wise methods or univariate screening are not recommended; if used,
backwards selection is preferred over forward selection & stopping rules should
reflect sample size
c. Consider avoiding predictors that have: little chance of being measured reliably;
distributions that are too narrow; a high number of missing values
d. Always select continuous predictors over categorical ones
e. Consider roles that each of the other variables may play in the causal pathway (in
effect estimation/explanatory modeling), especially when there is a focal variable
(i.e., confounder, collider, effect modifier, mediator, etc., see Greenland, Pearl &
Robins, 1999)
i. Variables that are hypothesized to mediate the relationship between a focal
variable and the outcome should not be included in the model (mediation
should be tested using specific procedure, such as path analysis);
ii. Colliders (a variable that is caused by X and Y) should also not be included in
the model
f. Avoid overfitting the model with too many predictors (when a model fitted with too
many degrees of freedom with respect to the number of observations or events in
binary models). Rules of thumb to avoid overfitting:
i. Binary outcomes - p ≤ m/10, where p is the degrees of freedom (df), m is the
number of cases in the less frequent outcome category
ii. Continuous outcomes – 10-15 observations/df
iii. Survival models – 10-15 events/df
g. Use data reduction methods if overfitting is a concern (e.g. Principle Components
Analysis [PCA], score a group of predictors, see Harrell, 2015, Chapter 4)
2. Interactions
a. Careful consideration of inclusion of interactions (must be biologically plausible)
3. Missing Data
a. Consider whether missing data is Missing Completely At Random (MCAR), Missing
At Random (MAR), or Not Missing At Random (NMAR) (see Schafer & Graham,
b. Avoid lowering sample size (statistical power) with complete case analysis; consider
multiple imputation or maximum likelihood methods when data are MAR or MCAR
(see Baraldi & Enders, 2009; Harel et al. 2018)
c. Special models (e.g. pattern mixture or selection models) required for NMAR and
should be compared to a model assuming MAR as a sensitivity test as missing data
mechanisms generally cannot be tested
4. Choice of a Statistical Model
a. Could be based on prior distributional examination, but often based on maximizing
how available information is used
b. Consider if outcome observations are statistically independent – if repeated over
time or clustered in another way (e.g. patients within clinics), a method must be
selected that can accommodate non-independence
Verifying Model Assumptions
5. Linearity assumption - testing linearity in the relationship between dependent variable and
a. Test by respecifying the predictor as X2 or categories of X, and compare model fit for
nested models using F-change or the likelihood ratio test (LRT)
6. Additivity assumption - Test for interaction if specified above
a. Use “chunk test” to test for all interactions at once and compare to a model without
interactions using LRT or f-change test (see Kleinbaum, et al, 2014);
b. If an interaction(s) is/are found to be significant, model including the interaction(s) is
the correct model
c. Use visualization techniques to help interpret interactions
7. Distributional assumption
a. For linear regression, check that residuals are approximately normally-distributed
using visualization methods (e.g. histograms, plots); don’t use statistical testing
i. Transform outcome if residuals are non-normal using the log, power, square
root or other transformation
b. For linear regression, check for homoscedasticity using plots; don’t use statistical
i. Some transformations can work to stabilize the variance (log and square root)
or a different modeling strategy can be used (e.g. Poisson or negative
binomial model)
c. For cox proportional hazards model, test for proportionality assumption
8. Testing for multicollinearity
a. This is done using multivariable regression and calculation of the Variance Inflation
Factor (VIF) via Ordinary Least Squares (OLS), not bivariate correlation
b. Rather than arbitrarily removing one collinear variable, a better approach is to use a
summary score of the two collinear variables (if applicable)
c. For prediction models, multicollinearity is less of an issue
9. Check for overly influential observations
a. Make sure data point is not erroneous!
b. No firm guidelines on how to treat these observations; could model with and without
the data point to see if model changes substantially and if there are important
differences, this should be reported
c. Deletion of influential points (unless it is clearly erroneous) is very hard to justify
Final Model Evaluation
10. Selecting a Final Model
a. Multiple nested models can be used (if justified), especially when interest is in an
explanatory model of multiple predictors of an outcome and variables can be
grouped (i.e., hierarchical entry methods)
i. difference in model χ2 (Likelihood ratio test [LRT]), f-change can be used to
compared between two nested models statistically;
ii. AIC, BIC can be used for non-nested models, lower values are better (AIC is
also used for predictive models)
b. Do not remove non-significant predictors - “Full fit” model is the only model that
provides accurate standard errors, error mean square and p-values (Harrell, 2015)
c. For predictive modeling, use a variety of methods for model selection, including
extent of discrimination (see Harrell 2015, Chapter 4)
11. Assessing Model Performance
a. Explanatory power – R2
b. Predictive power (often assessed in conjunction with bootstrapping or an external
i. Calibration (Hosmer-Lemeshow test)
ii. Discrimination (R2, model χ2, Area Under the Curve[AUC]/C-Statistic)
12. Shrinkage
a. More often dealt with in the context of prediction models - can use bootstrapping to
determine the degree of overfitting and how much shrinkage is necessary to correct
the model’s coefficients
b. Not necessary if penalization techniques are used in the regression method
13. Model validation
a. Not necessary for effect estimation models, absolutely necessary for prediction
models: Internal validation using resampling (i.e. bootstrapping) or external validation
needed for prediction models
