Innovative statistical approaches in health services research: multiple informant analyses Nicholas Horton Department of Mathematics Smith College, Northampton MA nhorton@email.smith.edu http://www.biostat.harvard.edu/multinform Acknowledgements • Joint work with Garrett Fitzmaurice and Nan Laird, Harvard School of Public Health • Jane Murphy and the Stirling County Study for use of their example dataset • Supported by NIH grant RO1-MH54693 Outline • • • • • • • Motivation for multiple source data Examples of multiple sources/informants Models for correlated multiple source data Accounting for complex survey design Accounting for incomplete/missing data Example (Stirling County Study) Conclusions Why multiple source data? • to provide better measures of some underlying construct that is difficult to measure or likely to be missing • also known as multiple informant reports, proxy reports, co-informants, etc. • discordance is expected, otherwise there is no need to collect multiple reports Definition of multiple source data • data obtained from multiple informants or raters (e.g., self-reports, family members, health care providers, teachers) • or via different/parallel instruments or methods (e.g., symptom rating scales, standardized diagnostic interviews, or clinical diagnoses) • None of the reports is a “gold’’ standard • We consider multiple source data that are commensurate (multiple measures of the same underlying variable on a similar scale) Examples of multiple source data • child psychopathology (ask parents, teachers and children about underlying psychological state) • service utilization studies (collect information from subjects and databases) • medical comorbidity (query providers and charts to assess medical problems) Examples of multiple source data (cont.) • adherence studies (collect self-report of adherence, electronic pill caps [MEMS] plus pharmacy records) • nutritional epidemiology (utilize multiple dietary instruments such as food frequency questionnaires, 24-hour recalls, food diaries) Incomplete/missing reports • Multiple source reports are commonly incomplete since, by definition, they are collected from sources other than the primary subject of the study • This missingness may be by design or happenstance (or both!) Example: missing source reports • Consider service utilization studies that collect information from subjects and databases • Subjects may be lost to follow-up (or only contacted periodically) • Databases may be incomplete (lack of consent, lack of appropriate coverage) Analytic approach • Multiple sources can provide information on outcomes or predictors (risk factors) • Multiple source outcome: what is the prevalence of child psychopathology? (measured using parallel parent and teacher reports) • Fitzmaurice et al (AJE, 1995), Horton et al (HSOR, 2002), Horton and Fitzmaurice (SIM tutorial, in press) Analytic approach (cont.) • Multiple source predictor: what are the odds of developing depression in adulthood, conditional on parallel reports of anxiety (collected from a child and a parent)? • Examples: Horton et al (AJE, 2001), Lash et al (AJE, 2003), Liddicoat et al (JGIM, 2004), Horton and Fitzmaurice (SIM tutorial, in press) • We will focus on an example using multiple source predictors Notation • Let Y denote a univariate outcome for a given subject • Let X L denote the l’th multiple source predictor • Let Z denote a vector of other covariates for the subject • To simplify exposition, we consider two sources with dichotomous reports (L=2) Questions to consider • Are the sources reporting on the same underlying construct (are they commensurate or interchangeable?) • Is it possible to combine the reports in some fashion? • How to handle missing reports? Analytic approaches • Reviewed in Horton, Laird and Zahner (IJMPR, 1999) • Use only one source f (Y | X1 ,Z) • Fit separate models f (Y | X 1 , Z ) f (Y | X 2 , Z ) Analytic approaches (cont.) • Combine (pool) the reports in some fashion XOR OR( X1 , X 2 ) f (Y | XOR, Z ) • Include both reports in the model f (Y | X1 , X 2 ,Z) Analytic approaches (cont.) • We considered simultaneous estimation of the marginal models: f (Y | X 1 , Z ) (1) 0 f (Y | X 2 , Z ) ( 2) 0 (1) 1 X1 Z ( 2) 1 (1) 2 X2 ( 2) 2 Z • Non-standard application of GEE • Method independently suggested by Pepe et al (SIM, 1999) Advantages of new approach • can be used to test for source differences in association with the outcome (1) 1 (2) 1 • can test if the effects of other risk factors on the outcome differ by source (1) 2 (2) 2 Advantages of new approach • different source effects where necessary • a pooled model can be fit if no significant source effects (potentially more efficient) f (Y | X 1 , Z ) (1) 0 f (Y | X 2 , Z ) ( 2) 0 (1) 1 X1 Z (1) 1 (1) 2 X2 Z (1) 2 • can be fit using general purpose statistical software Accounting for survey design • Many health services or epidemiologic studies arise from complex survey samples • Need to address stratification, multi-stage clustering and unequal sampling weights • Failing to properly account for survey design may lead to bias and incorrect estimation of variability Accounting for survey design (cont.) • Estimation proceeds using the approximate (quasi) log-likelihood (weighted version of the usual score equations for a GLM, accounting for the multi-stage clustering, including multiple source reports) • Can be fit using general purpose statistical software (e.g. Stata) Accounting for incomplete source reports • Missing source reports in this setting are missing predictors • Account for MAR missingness by weighted estimating equation methodology of Robins et al (JASA, 1994) and Xie and Paik (Biometrics, 1997) • Adds an additional “missingness weight” • Complications to variance estimation Example: Stirling County • Outcome: time to event (death) over 16 year follow-up period (1952-1968) (n=1079) • multiple source predictors: partially observed dichotomous physician report or self report of psychiatric disorder • other predictors: age (3 categories), gender • statistical model: piecewise exponential survival with 4 intervals each of 4 years duration (subjects contribute time at risk in each interval) Stirling County survey design Strata Stratum11 Stratum k PSU 1 PSU j selfreport Stratum K PSU J phys.report Stirling County missingness • Complete data on mortality • Relatively few reports of diagnosis missing (5% physician, 7% self) • For missing physicians, MCAR plausible • Missing self-reports associated with demographics and physician report • Accounting for missingness did not affect results (Horton et al, AJE, 2001) Results (separate parameters) • Initially fit model with separate parameters • No evidence for any non-zero source terms • Implies that the association between risk factors and mortality did not differ by source • Dropped these terms from the model, yielding parsimonious shared parameter model with smaller standard errors Results (shared parameters) Parameter (log MRR) female Estimate (SE) mid-age 2.48 (0.28) older-age 3.53 (0.33) diagnosis 1.62 (0.33) diagnosis*mid-age -1.35 (0.38) diagnosis*older-age -1.31 (0.46) -0.13 (0.15) Interpretation of results (annual mortality rate) Age < 50 Age >= 70 Diagnosis=0 0.001 0.056 Diagnosis=1 0.007 0.093 Conclusions • new methods of analysis of multiple source data are available • can be implemented using existing software • methods allow the assessment of the relative association of each source • each source yielded similar conclusions: association between psychiatric disorder and mortality is stronger for younger subjects • unified model has less variability, pools information after testing for systematic differences Conclusions (cont.) • methods account for complex survey designs • methods incorporate partially observed subjects to contribute, under MAR assumptions • multiple source reports arise in many settings (not just for children anymore!) Innovative statistical approaches in health services research: multiple informant analyses Nicholas Horton Department of Mathematics Smith College, Northampton MA nhorton@email.smith.edu http://www.biostat.harvard.edu/multinform Future work • Maximum-likelihood estimation instead of GEE approach – May yield efficiency gains – Particularly useful for missing reports • Non-commensurate reports – Different scales – Different underlying constructs – Consider latent variable models (e.g. work of Landrum, Normand)