Prepared by Sarah Brennenstuhl Last updated: January 14 2020 Strategies for Model Building Decide on Type of Model 1. Effect Estimation or Explanatory Models (Inference)- these are models are tools for: a. Effect estimation (e.g., adjustment for predictors in experimental design), b. Providing the basis for hypothesis testing (e.g. testing the causal relationship between a focal variable and outcome, or understanding the effects of multiple predictors on an outcome) i. For these models, there is little need for parsimony, goal is to develop accurate and complex model that best capture the data 2. Prediction Models - these include models with the goal of: a. Predicting an outcome or predicting new or future observations, i. For these models, there is need to balance complexity/accuracy and parsimony – prediction needs to be generalizable to external populations See Shmueli (2010) for an in depth explanation of the distinction between model types. Preliminary Steps 1. Variable selection a. A priori methods of variable selection are recommended based on subject matter knowledge (e.g. clinical experience, theoretical knowledge, empirical evidence) b. Step-wise methods or univariate screening are not recommended; if used, backwards selection is preferred over forward selection & stopping rules should reflect sample size c. Consider avoiding predictors that have: little chance of being measured reliably; distributions that are too narrow; a high number of missing values d. Always select continuous predictors over categorical ones e. Consider roles that each of the other variables may play in the causal pathway (in effect estimation/explanatory modeling), especially when there is a focal variable (i.e., confounder, collider, effect modifier, mediator, etc., see Greenland, Pearl & Robins, 1999) i. Variables that are hypothesized to mediate the relationship between a focal variable and the outcome should not be included in the model (mediation should be tested using specific procedure, such as path analysis); ii. Colliders (a variable that is caused by X and Y) should also not be included in the model f. Avoid overfitting the model with too many predictors (when a model fitted with too many degrees of freedom with respect to the number of observations or events in binary models). Rules of thumb to avoid overfitting: i. Binary outcomes - p ≤ m/10, where p is the degrees of freedom (df), m is the number of cases in the less frequent outcome category ii. Continuous outcomes – 10-15 observations/df iii. Survival models – 10-15 events/df g. Use data reduction methods if overfitting is a concern (e.g. Principle Components Analysis [PCA], score a group of predictors, see Harrell, 2015, Chapter 4) 2. Interactions a. Careful consideration of inclusion of interactions (must be biologically plausible) Prepared by Sarah Brennenstuhl Last updated: January 14 2020 3. Missing Data a. Consider whether missing data is Missing Completely At Random (MCAR), Missing At Random (MAR), or Not Missing At Random (NMAR) (see Schafer & Graham, 2002) b. Avoid lowering sample size (statistical power) with complete case analysis; consider multiple imputation or maximum likelihood methods when data are MAR or MCAR (see Baraldi & Enders, 2009; Harel et al. 2018) c. Special models (e.g. pattern mixture or selection models) required for NMAR and should be compared to a model assuming MAR as a sensitivity test as missing data mechanisms generally cannot be tested 4. Choice of a Statistical Model a. Could be based on prior distributional examination, but often based on maximizing how available information is used b. Consider if outcome observations are statistically independent – if repeated over time or clustered in another way (e.g. patients within clinics), a method must be selected that can accommodate non-independence Verifying Model Assumptions 5. Linearity assumption - testing linearity in the relationship between dependent variable and predictors a. Test by respecifying the predictor as X2 or categories of X, and compare model fit for nested models using F-change or the likelihood ratio test (LRT) 6. Additivity assumption - Test for interaction if specified above a. Use “chunk test” to test for all interactions at once and compare to a model without interactions using LRT or f-change test (see Kleinbaum, et al, 2014); b. If an interaction(s) is/are found to be significant, model including the interaction(s) is the correct model c. Use visualization techniques to help interpret interactions 7. Distributional assumption a. For linear regression, check that residuals are approximately normally-distributed using visualization methods (e.g. histograms, plots); don’t use statistical testing i. Transform outcome if residuals are non-normal using the log, power, square root or other transformation b. For linear regression, check for homoscedasticity using plots; don’t use statistical testing i. Some transformations can work to stabilize the variance (log and square root) or a different modeling strategy can be used (e.g. Poisson or negative binomial model) c. For cox proportional hazards model, test for proportionality assumption 8. Testing for multicollinearity a. This is done using multivariable regression and calculation of the Variance Inflation Factor (VIF) via Ordinary Least Squares (OLS), not bivariate correlation b. Rather than arbitrarily removing one collinear variable, a better approach is to use a summary score of the two collinear variables (if applicable) c. For prediction models, multicollinearity is less of an issue 9. Check for overly influential observations a. Make sure data point is not erroneous! Prepared by Sarah Brennenstuhl Last updated: January 14 2020 b. No firm guidelines on how to treat these observations; could model with and without the data point to see if model changes substantially and if there are important differences, this should be reported c. Deletion of influential points (unless it is clearly erroneous) is very hard to justify Final Model Evaluation 10. Selecting a Final Model a. Multiple nested models can be used (if justified), especially when interest is in an explanatory model of multiple predictors of an outcome and variables can be grouped (i.e., hierarchical entry methods) i. difference in model χ2 (Likelihood ratio test [LRT]), f-change can be used to compared between two nested models statistically; ii. AIC, BIC can be used for non-nested models, lower values are better (AIC is also used for predictive models) b. Do not remove non-significant predictors - “Full fit” model is the only model that provides accurate standard errors, error mean square and p-values (Harrell, 2015) c. For predictive modeling, use a variety of methods for model selection, including extent of discrimination (see Harrell 2015, Chapter 4) 11. Assessing Model Performance a. Explanatory power – R2 b. Predictive power (often assessed in conjunction with bootstrapping or an external dataset): i. Calibration (Hosmer-Lemeshow test) ii. Discrimination (R2, model χ2, Area Under the Curve[AUC]/C-Statistic) 12. Shrinkage a. More often dealt with in the context of prediction models - can use bootstrapping to determine the degree of overfitting and how much shrinkage is necessary to correct the model’s coefficients b. Not necessary if penalization techniques are used in the regression method 13. Model validation a. Not necessary for effect estimation models, absolutely necessary for prediction models: Internal validation using resampling (i.e. bootstrapping) or external validation needed for prediction models Prepared by Sarah Brennenstuhl Last updated: January 14 2020 Please cite the above information properly! Get the original sources below: References Baraldi, A. N. & Enders, C. K. An Introduction to Modern Missing Data Analyses. Journal of School Psychology, 48:5-37. Greenland, S., Pearl, J., & Robins, J.M. (1999). Causal Diagram for Epidemiologic Research, Epidemiology, 10:37-48. Harel, O. et al. (2018). Multiple Imputation for Incomplete Data in Epidemiologic Studies. American Journal of Epidemiology, 187:576-584. Harrell, F. E. Jr., Lee, K.L. & Mark, D. B. (1996). Mutivariable Prognostic Models: Issues in Developing Models, Evaluation Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, 15:361-387. Harrell, F. E. Jr. (2015). Regression Modeling Strategies: With applications to Linear Models, Logistic and Ordinal Regression and Survival Analysis. Second Edition. Springer: Switzerland. Kleinbaum, D., Kupper, L., Nizam, A., Rosenberg, E.S. (2014). Applied Regression Analysis and other Multivariable Methods. Fifth Edition. Cengage Learning, Boston, MA. Schafer, J. L. & Graham, J. W. (2002). Missing Data: Our View of the State of the Art. Psychological Methods, 7:147-177. Shmueli, F. (2010). To Explain or to Predict? Statistical Science, 25:289-310. Vittinghoff, E. Glidden, D.V., Shiboski, S.C., & McCulloch, C.E. (2005). Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. Springer: New York.