observationalhandout

advertisement
Modeling Observational Data Workshop
Mike Babyak, PhD
The Meeting of the American Psychosomatic Society
Baltimore, MD, March 2008
Key Concepts:

Observational studies and clinical trials probably agree more often than is
believed
o Biggest threat is unmeasured variable/selection bias

Statistics is a cumulative field, with much recent progress being made by way of
simulation experiments

Use all information from variables/data
o Use imputation techniques for missing data
o Generally a bad idea to make categories out of non-categorical variables

Statistical adjustment makes the assumptions of:
o Parallelism
o Reasonable overlap among distributions

Mediation and confounding cannot be distinguished mathematically
o Depends on your theoretical causal model
o Beware that confounders can actually be mediators

Multivariable models with observational data assumes we have all the important
variables measured and available
o We rarely have enough data to support such large models
o Prespecified models are the best place to start
o Special consideration is needed when there is not enough data
 Combine predictors
 Use penalization, propensity scoring
Unless there is an enormous amount of data, better to avoid automated variable
selection techniques
Be aware that univariate prescreening also biases the model
Better to prespecify interactions
o Poor power, but also unstable estimates
o Avoid separate within subgroups analysis
o May want to report trends in interactions, using appropriately cautious
language




We are more often in an exploratory mode than we realize

Use “truth in advertising”
Selected references and web resources
Mike Babyak’s e-mail
michael.babyak@duke.edu
A copy of this presentation
http://www.duke.edu/~mababyak
Observational Data and Clinical Trials
http://www.epidemiologic.org/2006/11/agreement-of-observational-and.html
http://www.epidemiologic.org/2006/10/resolving-differences-of-studies-of.html
Propensity Scoring
Rubin Symposium notes
http://www.symposion.com/nrccs/rubin.htm
Rosenbaum, P.R. and Rubin, D.B. (1984). "Reducing bias in observational studies using
sub-classification on the propensity score." Journal of the American Statistical
Association, 79, pp. 516-524.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference, Cambridge University
Press.
Rosenbaum, P. R., and Rubin, D. B., (1983), The Central Role of the Propensity Score in
Observational Studies for Causal Effects, Biometrica, 70, 41-55.
Missing Data and Imputation
http://www.stat.psu.edu/~jls/mifaq.html
http://www.multiple-imputation.com/
Mediation and Confounding
MacKinnon DP, Krull JL, Lockwood CM. Equivalence of the mediation, confounding
and suppression effect. Prev Sci (2000) 1:173–81
General Modeling
A nice web tutorial on some of the concepts presented today
symptomresearch.nih.gov/chapter_8/
Harrell FE Jr. Regression modeling strategies: with applications to linear models, logistic
regression and survival analysis. New York: Springer; 2001.
Web page with many resources related to Harrell text
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS
Sample Size in Multivariable Models
Kelley, K. & Maxwell, S. E. (2003). Sample size for Multiple Regression: Obtaining
regression coefficients that are accuracy, not simply significant. Psychological Methods,
8, 305–321.
Kelley, K. & Maxwell, S. E. (In press). Power and Accuracy for Omnibus and Targeted
Effects: Issues of Sample Size Planning with Applications to Multiple Regression
Handbook of Social Research Methods, J. Brannon, P. Alasuutari, and L. Bickman
(Eds.). New York, NY: Sage Publications.
Green SB. How many subjects does it take to do a regression analysis? Multivar Behav
Res 1991; 26: 499–510.
Peduzzi PN, Concato J, Holford TR, Feinstein AR. The importance of events per
independent variable in multivariable analysis, II: accuracy and precision of regression
estimates. J Clin Epidemiol 1995; 48: 1503–10
Peduzzi PN, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the
number of events per variable in logistic regression analysis. J Clin Epidemiol 1996; 49:
1373–9.
Dichotomization
Harrell’s dichotomization page
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuous
Cohen, J. (1983) The cost of dichotomization. Applied Psychological Measurement, 7,
249-253.
MacCallum R.C., Zhang, S., Preacher, K.J., & Rucker, D.D. (2002). On the practice of
dichotomization of quantitative variables. Psychological Methods, 7(1), 19-40.
Maxwell, SE, & Delaney, HD (1993). Bivariate median splits and spurious statistical
significance. Psychological Bulletin, 113, 181-190
Royston, P., Altman, D. G., & Sauerbrei, W. (2006) Dichotomizing continuous predictors
in multiple regression: a bad idea. Statistics in Medicine, 25,127-141.
Variable Selection
Thompson B. Stepwise regression and stepwise discriminant analysis need not apply
here: a guidelines editorial. Ed Psychol Meas 1995; 55: 525–34.
Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression
model. Stat Med 2003; 8: 771–83.
Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection
algorithms: frequency of obtaining authentic and noise variables. Br J Math Stat Psychol
1992; 45: 265–82.
Steyerberg EW, Harrell FE, Habbema JD. Prognostic modeling with logistic regression
analysis: in search of a sensible strategy in small data sets. Med Decis Making 2001; 21:
45–56.
Cohen J. Things I have learned (so far). Am Psychol 1990; 45: 1304–12.
Roecker EB. Prediction error and its estimation for subset-selected models
Technometrics 1991; 33: 459–68.
Univariate Pretesting and Transformation
Grambsch PM, O’Brien PC. The effects of preliminary tests for nonlinearity in
regression. Stat Med 1991; 10: 697–709.
Faraway JJ. The cost of data analysis. J Comput Graph Stat 1992; 1: 213–29.
Validation and Penalization
Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD.
Internal validation of predictive models: efficiency of some procedures for logistic
regression analysis. J Clin Epidemiol 2001; 54: 774–81.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B 2003; 58:
267–88.
Greenland S . When should epidemiologic regressions use random coefficients?
Biometrics 2000 Sep 56(3):915-21
Moons KGM, Donders ART, Steyerberg EW, Harrell FE (2004): Penalized maximum
likelihood estimation to directly adjust diagnostic and prognostic prediction models for
overoptimism: a clinical example. J Clin Epidemiol 2004;57:1262-1270.
Steyerberg EW, Eijkemans MJ, Habbema JD. Application of shrinkage techniques in
logistic regression analysis: a case study. Stat Neerl 2001; 55:76-88.
Some simulation results relating to validation
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/logistic.val.pdf
Software
R software (free open source)
http://cran.r-project.org/
S-Plus software (commercial version of R with Windows gui)
http://insightful.com
SAS Macros for spline estimation
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/SasMacros
SAS code for bootstrap
ftp://ftp.sas.com/pub/neural/jackboot.sas
Download