Using Stata for Subpopulation Analysis of Complex Sample Survey Data Brady T. West

advertisement
Using Stata for Subpopulation
Analysis of Complex Sample
Survey Data
Brady T. West
PhD Student
Michigan Program in Survey Methodology
July 30, 2009
2009 Stata Conference
Presentation Outline
1.
2.
3.
4.
Introduction: Subclass Analysis Issues
Kish’s Taxonomy of Subclasses
Two Alternative Approaches to Inference
Variance Estimation and Methods for
‘Singletons’
5. Examples using NHANES and NHAMCS Data
6. Suggestions for Practice
7. Directions for Future Research
2009 Stata Conference: Subpop
Analysis of Survey Data
2
Subclass Analysis Issues
• Analysts of large, complex sample survey
data sets are often interested in making
inferences about subpopulations of the
original population that the sample was
selected from (e.g., Caucasian Females)
• These subpopulations are referred to
interchangeably in various literatures as
subgroups, subclasses, subpopulations,
domains, and subdomains, leading to
confusion among analysts of survey data
2009 Stata Conference: Subpop
Analysis of Survey Data
3
Subclass Analysis Issues, cont’d
• Software procedures for analysis of
complex sample survey data are becoming
more powerful, flexible, and widely
available, offering analysts several options
• Analysts need to be careful when analyzing
subclasses, and be aware of the alternative
approaches to subclass analysis that are
possible and their implications for inference
2009 Stata Conference: Subpop
Analysis of Survey Data
4
Kish’s Taxonomy of Subclasses
• Design Domains: Restricted to specific strata
according to the complex sample design (usually
geographically, e.g., Texas)
• Cross-Classes: Broadly distributed (in theory)
across the strata and primary sampling units
defining a complex sample (e.g., AfricanAmericans over age 50)
• Mixed Classes: Disproportionately distributed
across the complex sample design (e.g., Hispanics
in a sample including Los Angeles as a stratum)
• See Kish (1987), Statistical Design for Research
2009 Stata Conference: Subpop
Analysis of Survey Data
5
Design Domains
X = Sample Element in Subclass
Stratum
PSU 1
PSU 2
1
2
XXXXXXX
XXXX
XXXXXXX
XXX
XXXXXXX
XX
XXXXXXX
XXXXX
3
4
5
2009 Stata Conference: Subpop
Analysis of Survey Data
6
Cross-Classes
Stratum
1
2
3
4
5
PSU 1
PSU 2
XXXXXXX XXXXX
XXXXX
XXXX
XXXXXXX
XXXXXXX XXXXXXX
XXXX
XX
XXXXXX XXXXX
XXXXXXX XXXXXXX
XXX
XXXXX
2009 Stata Conference: Subpop
Analysis of Survey Data
7
Mixed Classes
Stratum
1
2
3
4
5
PSU 1
PSU 2
XXXXXXX XXXXXXX
XXXXXXX XXXXXX
X
XXXXXXX XXXXXXX
XXXXXX XXX
XX
XXXXXXX XXXXXXX
XXXXXXX XXXXX
2009 Stata Conference: Subpop
Analysis of Survey Data
8
Applying Kish’s Taxonomy
•
•
The type of subclass is critical for
determining an appropriate analysis
approach
Two possible approaches to inference
motivated by the taxonomy:
1. Unconditional approach (cross-classes,
mixed classes)
2. Conditional approach (design domains)
2009 Stata Conference: Subpop
Analysis of Survey Data
9
The Unconditional Approach
• Appropriate for Cross-Classes, and in some
cases Mixed Classes; the subclass of interest
theoretically can appear in all design strata
and primary sampling units (PSUs)
• KEY POINT: Allow the software to process
the entire survey data set, and recognize all
possible design strata and PSUs; DO NOT
delete sample cases not in the subclass!
2009 Stata Conference: Subpop
Analysis of Survey Data
10
The Unconditional Approach
• Rationale: estimated variances for sample
estimates of subclass parameters (based on
within-stratum variance between PSUs)
need to reflect sample-to-sample variability
based on the full complex design
• In other words, if a particular subclass does
not appear in a PSU in any given sample
(although in theory it could have), that PSU
should contribute 0 to variance estimates,
rather than be ignored completely!
2009 Stata Conference: Subpop
Analysis of Survey Data
11
The Unconditional Approach
• Further, the subclass sample size in each
stratum is going to be a random variable,
and theoretical sample-to-sample variance
in realizations of this random variable
should be incorporated into any variance
estimation procedures
2009 Stata Conference: Subpop
Analysis of Survey Data
12
The Unconditional Approach
• If cross-classes (or in some cases mixed classes)
are being analyzed, and PSUs where the subclass
does not appear (by random chance) are deleted,
problems arise
• Some strata may appear to have only one PSU by
design (preventing variance estimation unless an
ad hoc approach is used)
• Entire design strata may be dropped, impacting
variance estimates and calculations of degrees of
freedom
2009 Stata Conference: Subpop
Analysis of Survey Data
13
The Unconditional Approach:
General Stata Code
• svy, subpop(indicator): command varlist, options
• indicator = an indicator variable for the subpop or
an if condition, e.g., if male == 1
• svy: mean, over(groupvar)
• svy: prop, over(groupvar)
• Stata drops strata* with no subpopulation
observations from degrees of freedom calculations
* Exercise: repeat 10 times really fast
2009 Stata Conference: Subpop
Analysis of Survey Data
14
The Conditional Approach
• Appropriate for Design Domains, where a
subclass cannot appear outside of specific
design strata
• The rationale behind the unconditional
approach no longer applies
• Certain design strata should not contribute
to variance estimation or calculation of
degrees of freedom
2009 Stata Conference: Subpop
Analysis of Survey Data
15
The Conditional Approach
• Restrict the analysis to only those design
strata where the subclass of interest exists
• Variance estimates reflecting sample-tosample variability should only be based on
those design strata where the subclass can
appear (unlike the unconditional approach)
• Subclass sample sizes in design domains are
assumed to be fixed, by design
2009 Stata Conference: Subpop
Analysis of Survey Data
16
The Conditional Approach:
General Stata Code
• svy: command varlist if (condition), options
• (condition) might be male == 1, or a more
complex combination of conditions (e.g.,
male == 1 & age >= 50 & age <= 90)
2009 Stata Conference: Subpop
Analysis of Survey Data
17
Variance Estimation Methods
• All of these issues are only relevant when
using Taylor Series Linearization, which is
a default for variance estimation in Stata
• Conditional analyses are OK to perform
when using replication methods, such as
Balanced Repeated Replication or Jackknife
Repeated Replication (Rust and Rao, 1996)
2009 Stata Conference: Subpop
Analysis of Survey Data
18
Ad-hoc Fixes for ‘Singleton’
Clusters in Stata 10.1
•
Stata 10.1 provides users with four ad-hoc fixes
for the problem where strata are identified with
only a single ultimate cluster for variance
estimation in a subpopulation analysis:
1. Report Missing Standard Errors (not really a fix)
2. Treat Units as Certainty Units, which contribute
nothing to the standard error
3. Scale Variance using Certainty Units, which uses the
average variance from each stratum with multiple
PSUs for each stratum with only a single PSU
4. Center at the Grand Mean, where the variance
contribution comes from a deviation from the grand
mean instead of the stratum mean
2009 Stata Conference: Subpop
Analysis of Survey Data
19
Example: The NHANES Data
• We first consider examples based on the
NHANES II data set, collected from a
nationally representative multistage
probability sample of the U.S. population
from 1976-1980 (oldie but a goodie)
• Briefly, a sample of the U.S. population was
given medical examinations in an effort to
assess the health of the U.S. population
2009 Stata Conference: Subpop
Analysis of Survey Data
20
Example NHANES Analysis
• Analysis Subclass: African-Americans ages
50 and above (this is a cross-class of the
U.S. population, which can theoretically
appear in all design strata and PSUs)
• Analysis Objective: Estimate the mean
systolic blood pressure of this subclass and
an appropriate standard error
• See West et al. (2007) for more details
2009 Stata Conference: Subpop
Analysis of Survey Data
21
Conditional Approach:
Stata Code for NHANES Analysis
• svyset ppsu [pweight = fwgtexam],
strata(stratum) singleunit(missing)
• svyset ppsu [pweight = fwgtexam],
strata(stratum) singleunit(centered)
• Also singleunit(certainty),
singleunit(scaled)
• gen b50subp = (race == 2 & ager >= 50)
• svy: mean bpsyst if b50subp == 1
2009 Stata Conference: Subpop
Analysis of Survey Data
22
Conditional Approach: Results
Method
Est. Mean
TSL SE
Design DF
Missing SE
144.09
.
50-29 = 21
Centered
144.09
1.66
50-29 = 21
Certainty
144.09
1.62
50-29 = 21
Scaled
144.09
1.90
50-29 = 21
2009 Stata Conference: Subpop
Analysis of Survey Data
23
Conditional Approach?
• This approach would not be appropriate for
this particular subclass
• Computed standard errors would generally
be biased downward, because additional
sources of sample-to-sample variability are
ignored when following this approach
• Same issues apply for analytic models
• Evidence that the “scaled” ad-hoc fix may
be overly conservative!
2009 Stata Conference: Subpop
Analysis of Survey Data
24
Unconditional Approach:
Stata Code for NHANES Analysis
• svyset ppsu [pweight = fwgtexam],
strata(stratum) singleunit(missing)
• Note: choice of single unit option does not
matter when following this approach!
• gen b50subp = (race == 2 & ager >= 50)
• svy, subpop(b50subp): mean bpsyst
2009 Stata Conference: Subpop
Analysis of Survey Data
25
Unconditional Approach: Results
Method
Est. Mean
TSL SE
Des. DF*
Missing SE
144.09
1.66
58-29 = 29
Centered
144.09
1.66
58-29 = 29
Certainty
144.09
1.66
58-29 = 29
Scaled
144.09
1.66
58-29 = 29
* Note: Stata dropped three strata with no sample units in the subpopulation.
2009 Stata Conference: Subpop
Analysis of Survey Data
26
Unconditional Approach?
• This approach would be the appropriate
choice for a cross-class such as AfricanAmericans over the age of 50
• Inferences are theoretically appropriate
• Same idea for analytic models
• Results suggest that the “centered” and
“certainty” ad-hoc fixes for conditional
analyses are reasonable
2009 Stata Conference: Subpop
Analysis of Survey Data
27
Example: The NHAMCS Data
• Analysis Subclass: Visits to Emergency
Departments (ED) by African-American men ages
60 and above (this is another cross-class of the
U.S. population, which can theoretically appear in
all NHAMCS design strata and PSUs)
• Analysis Objective: Estimate the percentage of all
ED visits by members of this subclass for
dizziness and/or vertigo in 2004
• See West et al. (2008) for more details
2009 Stata Conference: Subpop
Analysis of Survey Data
28
Stata Code for NHAMCS Analyses
• svyset cpsum [pweight = patwt],
strata(cstratm) singleunit(…)
• generate subc = (settype == 3 & sex == 2 &
agecat == 5 & race == 2)
• svy: tabulate dizzyrfv if subc == 1, se ci
percent * conditional
• svy, subpop(subc): tabulate dizzyrfv, se ci
percent * unconditional
2009 Stata Conference: Subpop
Analysis of Survey Data
29
NHAMCS Analysis Results
Method
Est. %
TSL SE
Design DF
Missing SE
4.82
1.576
106
Centered
4.82
1.576
106
Certainty
4.82
1.576
106
Scaled
4.82
1.576
106
Unconditional
4.82
1.590
286
2009 Stata Conference: Subpop
Analysis of Survey Data
30
NHAMCS Analysis Implications
• No problems with strata having only a single ultimate
cluster: ad-hoc fixes all give the same results
• Weighted point estimates are identical
• Substantially fewer design-based degrees of freedom when
following the conditional approach; the full complex
design will not be reflected in estimation of sample-tosample variance (many ultimate clusters are lost)
• Conditional analysis assumes that each sample will be of
fixed size n = 397 for variance estimation purposes; no
random variance!
• Conditional analysis results in overly liberal inferences
2009 Stata Conference: Subpop
Analysis of Survey Data
31
Suggestions for Practice
• Consider Kish’s Taxonomy when determining an
appropriate subclass analysis approach
• Utilize the appropriate software options for
unconditional analyses when analyzing crossclasses
• Be careful with missing values when creating the
subpopulation indicator
• The unconditional analysis approach generally
works fine for both cases (when in doubt, use this
approach)
2009 Stata Conference: Subpop
Analysis of Survey Data
32
Directions for Future Research
• More appropriate calculation / estimation of
design-based and effective degrees of
freedom for sparse subclasses or mixed
classes
• Development of analytic theory for interval
estimation when working with small
subclasses, which does not rely on
asymptotic results
2009 Stata Conference: Subpop
Analysis of Survey Data
33
References
• Kish, L. 1987. Statistical Design for Research. New York:
Wiley.
• Rust, K. F., and J. N. K. Rao. 1996. Variance estimation for
complex surveys using replication. Statistical Methods in
Medical Research 5: 283–310.
• West, B.T., Berglund, P., and Heeringa, S.G. 2008. A
Closer Examination of Subpopulation Analysis of
Complex Sample Survey Data. The Stata Journal, 8(3), 112.
• West, B.T., Berglund, P., and Heeringa, S.G. 2007.
Alternative Approaches to Subclass Analysis of Complex
Sample Survey Data. Proceedings of the 2007 Joint
Statistical Meetings.
2009 Stata Conference: Subpop
Analysis of Survey Data
34
Questions / Thank You!
• For additional questions, comments, or
electronic copies of these slides or the
papers, please send an email to
bwest@umich.edu
2009 Stata Conference: Subpop
Analysis of Survey Data
35
Download