Advanced Analysis of Complex Survey Data Part I Julia L. Bienias

advertisement
Advanced Analysis of Complex
Survey Data
Part I
Julia L. Bienias
Presented at ARM 2009
Overview
• Motivation and Example
• Point Estimation
– Design vs. Model-Based
– Pseudo maximum-likelihood
• Variance Estimation
– Taylor Series/Linearization
– Jackknife
• Model Fit and Model Checking
– Partial residual plots
– Added variable plots
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
2
Why Health Surveys?
• Representative information at reasonable cost
• Some Examples
− Blood Lead in Children -- 37 % drop in high levels in
1976-80 Mahaffey et al. (1982) NEJM
− Children Growth Curves -- 12 million charts
Hamill et al. (1979) AJCN
− Rise in Cesarean Section Rate 1970-78 -- repeat csection not necessary
Placek & Taffel (1980) Public Hlth Rep
− The Community Intervention Trial for Smoking
Cessation COMMIT Research Group (1995) AJPH
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
3
Example: NHANES
•
•
•
•
•
Large-Scale Health Survey: The Third National
Health and Examination Survey (NHANES III)
$100 million household/medical examination survey
~40,000 individuals surveyed 1988-1994
Objectives:
− (1) Estimate national prevalence diseases/risk factors
− (2) National population references distributions of health
measures
− (3) Secular trends in diseases/risk factors
− (4) Disease etiology
− (5) Natural history of diseases
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
4
Example: NHANES, con.
• Multistage Stratified Cluster sample design
• 2812 Primary Sampling Units (PSUs), 13 certainty PSUs &
2799 PSUs divided in 34 strata
• 1st stage: 2 PSU's Sampled from each of 68 Strata, 13 Certainty
PSU's
• Counties oversampled if highly populated, large % African
American or large % Mexican-American.
• 2nd stage: City/Suburban Blocks or Contiguous Rural Areas
(called Segments) Randomly Sampled
• Blocks/Areas with large minority pop. oversampled.
• 3rd stage: Households Sampled
• Rate depended on racial and ethnic makeup.
• 4th stage: Individuals Sampled
• Rate depended on sex, age, race/ethnicity.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
5
Sample Weights
• Sampling individuals at different rates means
sampled individuals represent different numbers
of persons in population.
• For a surveyed person, his/her sample weight wi
estimates # of persons he/she represents.
e.g., wi = 12,302
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
6
Sample Weights, con.
• Basic weight is equal to reciprocal of the
probability of selection πi: wi=1/ πi .
• Adjustments for differential nonresponse and
undercoverage often also part of wi.
• Public-use data files contain wi and codes for
PSUs and strata.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
7
Goals of Inference
• Population totals, means
• Estimates of change (e.g., ratios)
• Hypothesis testing of model parameters
– This will be our focus today
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
8
Inferences about Models
• Ex.: Risk factor for disease
• Infer to sample? Frame? “Population”?
“Population of interest”?
• Design-based vs. Model-based
• Combine both
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
9
Three Concepts that Affect Inference
• Target of inference: frame, today’s
population, “all populations”
• Variance Estimation
• Accuracy of model
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
10
Target of Inference
• Finite population: Frame; population on
which frame was based.
– Weighted likelihood inference yields designconsistent estimators
– Is frame an unbiased representation of
population? If so, estimator still unbiased
• Superpopulation: Infinite set of such
populations
– Concept created for model-based approaches
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
11
Finite or Infinite?
In health research, aim is often superpopulation
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
12
Estimating Parameters: Pseudo-ML
• Include model in design-based estimates
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
13
Pseudo-ML, con.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
14
Pseudo-ML, con.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
15
Ex: Gestational Age and Birth Weight
• The 1988 National Maternal and Infant Health
Survey obtained data on birth weight and
gestational age from a US probability sample
of babies.
• Oversample of African-American and low-birth
weight babies.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
16
Gestational Age and Birth Weight, con.
• Regress age (weeks) on birth weight:
• A 100-g reduction in birthweight is
associated with a 1.5 day decrease in
gestational age in weighted analysis, but a 2.7
day decrease in gestational age in an
unweighted analysis.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
17
Gestational Age and Birth Weight , con.
• This is a function of non-linearity in the birth weightgestational age association and the oversampling of low birth
weight babies:
Unweighted
J. Bienias & M. Elliott
Weighted
ARM 2009: Advanced Analysis of Complex
Survey Data
18
Linear Regression for Survey Data
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
19
Estimating Variances for Linear
Regression Estimators
• Asymptotically unbiased for frame
• Taylor Linearization
• Replication Methods
– Idea: Replicating the sample design. Compute
replicate weights.
– Jackknife; Balanced Half-Sample Replication;
Bootstrap
• Two examples follow
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
20
Taylor Linearization
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
21
Taylor Linearization for Linear
Regression
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
22
Taylor Linearization for Linear
Regression
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
23
Replication-Based Methods
An alternative to linearization is replication or
resampling: elements of the sample are
dropped, a new estimator is computed using
the remaining elements of the sample, and
the resulting estimates resulting from
repeated applications of this process is used
to compute a variance estimator.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
24
Jackknife
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
25
Jackknife
• When clustering, stratification present:
- Drop clusters rather than individual elements
- Accounts for the fact that resampling is within strata
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
26
Confidence Intervals and Hypothesis
Tests for Linear Regression Parameters
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
27
Example with NHANES
• Individuals’ Sample in NHANES I.
• Systolic Blood Pressure regressed on
– size of place-of-residence (urban 1 million, urban <1 million
& rural), age, body mass index and sex for individuals 25+ yrs.
• Sample design approximated by 35 strata with 3
sampled PSUs, degrees of freedom d = 70.
• Place-of-Residence: three categories so two dummy
variables (q=2). Test of significance use a Wald statistic
times 69/140 compare to F(2,69), p = 0.064.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
28
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
29
Confidence Intervals and Hypothesis
Tests for Linear Regression Parameters
• Preceding assumes Remember, in practice we
have approximation
• Small df ?
– Use Satterthwaite adj. (available in SUDAAN)
– Ignore stratification
– Ignore clustering
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
30
Diagnostics
• Partial residual plots.
– Determining functional relationship between
independent variables and outcome.
• Added variable plots.
– Detecting influential points.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
31
Partial Residual Plot
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
32
Partial Residual Plot, con.
• Example: NHANES II
• Linear regression of systolic blood pressure on log
of blood lead, age, BMI for men 40-59.
• Partial residual plot for log(lead)
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
33
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
34
• Local linear smoother does not deviate from linearity so single
linear term of loglead is sufficient.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
35
Added Variable Plot
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
36
Added Variable Plot, con.
• Example: NHANES I
• Linear regression of systolic blood pressure on
age, BMI and dietary sodium for women 40-49
• Added variable plot for dietary sodium
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
37
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
38
Added Variable Plot
•Areas of bubbles
proportional to weights.
•Dotted line is weighted
least-squares line and has
slope 3.24, the sodium
coefficient.
•Pt. A is in influential
position to affect the slope
of line but it has a small
weight so not influential.
•Remove pt. A slope is 3.41.
•If pt. A had the weight of pt.
B then becomes highly
influential and slope is 0.34.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
39
Download