Advanced Analysis of Complex Survey Data Part I Julia L. Bienias Presented at ARM 2009 Overview • Motivation and Example • Point Estimation – Design vs. Model-Based – Pseudo maximum-likelihood • Variance Estimation – Taylor Series/Linearization – Jackknife • Model Fit and Model Checking – Partial residual plots – Added variable plots J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 2 Why Health Surveys? • Representative information at reasonable cost • Some Examples − Blood Lead in Children -- 37 % drop in high levels in 1976-80 Mahaffey et al. (1982) NEJM − Children Growth Curves -- 12 million charts Hamill et al. (1979) AJCN − Rise in Cesarean Section Rate 1970-78 -- repeat csection not necessary Placek & Taffel (1980) Public Hlth Rep − The Community Intervention Trial for Smoking Cessation COMMIT Research Group (1995) AJPH J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 3 Example: NHANES • • • • • Large-Scale Health Survey: The Third National Health and Examination Survey (NHANES III) $100 million household/medical examination survey ~40,000 individuals surveyed 1988-1994 Objectives: − (1) Estimate national prevalence diseases/risk factors − (2) National population references distributions of health measures − (3) Secular trends in diseases/risk factors − (4) Disease etiology − (5) Natural history of diseases J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 4 Example: NHANES, con. • Multistage Stratified Cluster sample design • 2812 Primary Sampling Units (PSUs), 13 certainty PSUs & 2799 PSUs divided in 34 strata • 1st stage: 2 PSU's Sampled from each of 68 Strata, 13 Certainty PSU's • Counties oversampled if highly populated, large % African American or large % Mexican-American. • 2nd stage: City/Suburban Blocks or Contiguous Rural Areas (called Segments) Randomly Sampled • Blocks/Areas with large minority pop. oversampled. • 3rd stage: Households Sampled • Rate depended on racial and ethnic makeup. • 4th stage: Individuals Sampled • Rate depended on sex, age, race/ethnicity. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 5 Sample Weights • Sampling individuals at different rates means sampled individuals represent different numbers of persons in population. • For a surveyed person, his/her sample weight wi estimates # of persons he/she represents. e.g., wi = 12,302 J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 6 Sample Weights, con. • Basic weight is equal to reciprocal of the probability of selection πi: wi=1/ πi . • Adjustments for differential nonresponse and undercoverage often also part of wi. • Public-use data files contain wi and codes for PSUs and strata. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 7 Goals of Inference • Population totals, means • Estimates of change (e.g., ratios) • Hypothesis testing of model parameters – This will be our focus today J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 8 Inferences about Models • Ex.: Risk factor for disease • Infer to sample? Frame? “Population”? “Population of interest”? • Design-based vs. Model-based • Combine both J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 9 Three Concepts that Affect Inference • Target of inference: frame, today’s population, “all populations” • Variance Estimation • Accuracy of model J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 10 Target of Inference • Finite population: Frame; population on which frame was based. – Weighted likelihood inference yields designconsistent estimators – Is frame an unbiased representation of population? If so, estimator still unbiased • Superpopulation: Infinite set of such populations – Concept created for model-based approaches J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 11 Finite or Infinite? In health research, aim is often superpopulation J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 12 Estimating Parameters: Pseudo-ML • Include model in design-based estimates J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 13 Pseudo-ML, con. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 14 Pseudo-ML, con. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 15 Ex: Gestational Age and Birth Weight • The 1988 National Maternal and Infant Health Survey obtained data on birth weight and gestational age from a US probability sample of babies. • Oversample of African-American and low-birth weight babies. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 16 Gestational Age and Birth Weight, con. • Regress age (weeks) on birth weight: • A 100-g reduction in birthweight is associated with a 1.5 day decrease in gestational age in weighted analysis, but a 2.7 day decrease in gestational age in an unweighted analysis. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 17 Gestational Age and Birth Weight , con. • This is a function of non-linearity in the birth weightgestational age association and the oversampling of low birth weight babies: Unweighted J. Bienias & M. Elliott Weighted ARM 2009: Advanced Analysis of Complex Survey Data 18 Linear Regression for Survey Data J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 19 Estimating Variances for Linear Regression Estimators • Asymptotically unbiased for frame • Taylor Linearization • Replication Methods – Idea: Replicating the sample design. Compute replicate weights. – Jackknife; Balanced Half-Sample Replication; Bootstrap • Two examples follow J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 20 Taylor Linearization J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 21 Taylor Linearization for Linear Regression J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 22 Taylor Linearization for Linear Regression J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 23 Replication-Based Methods An alternative to linearization is replication or resampling: elements of the sample are dropped, a new estimator is computed using the remaining elements of the sample, and the resulting estimates resulting from repeated applications of this process is used to compute a variance estimator. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 24 Jackknife J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 25 Jackknife • When clustering, stratification present: - Drop clusters rather than individual elements - Accounts for the fact that resampling is within strata J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 26 Confidence Intervals and Hypothesis Tests for Linear Regression Parameters J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 27 Example with NHANES • Individuals’ Sample in NHANES I. • Systolic Blood Pressure regressed on – size of place-of-residence (urban 1 million, urban <1 million & rural), age, body mass index and sex for individuals 25+ yrs. • Sample design approximated by 35 strata with 3 sampled PSUs, degrees of freedom d = 70. • Place-of-Residence: three categories so two dummy variables (q=2). Test of significance use a Wald statistic times 69/140 compare to F(2,69), p = 0.064. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 28 J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 29 Confidence Intervals and Hypothesis Tests for Linear Regression Parameters • Preceding assumes Remember, in practice we have approximation • Small df ? – Use Satterthwaite adj. (available in SUDAAN) – Ignore stratification – Ignore clustering J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 30 Diagnostics • Partial residual plots. – Determining functional relationship between independent variables and outcome. • Added variable plots. – Detecting influential points. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 31 Partial Residual Plot J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 32 Partial Residual Plot, con. • Example: NHANES II • Linear regression of systolic blood pressure on log of blood lead, age, BMI for men 40-59. • Partial residual plot for log(lead) J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 33 J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 34 • Local linear smoother does not deviate from linearity so single linear term of loglead is sufficient. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 35 Added Variable Plot J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 36 Added Variable Plot, con. • Example: NHANES I • Linear regression of systolic blood pressure on age, BMI and dietary sodium for women 40-49 • Added variable plot for dietary sodium J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 37 J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 38 Added Variable Plot •Areas of bubbles proportional to weights. •Dotted line is weighted least-squares line and has slope 3.24, the sodium coefficient. •Pt. A is in influential position to affect the slope of line but it has a small weight so not influential. •Remove pt. A slope is 3.41. •If pt. A had the weight of pt. B then becomes highly influential and slope is 0.34. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 39