The estimation strategy of the National Household Survey

advertisement
The estimation strategy
of the National Household
Survey (NHS)
François Verret,
Mike Bankier, Wesley Benjamin & Lisa Hayden
Statistics Canada
Presentation at the ITSEW 2011
June 21, 2011
Outline of the presentation
1. Introduction
2. Handling non-response error
3. Simulation set-up
4. Results
5. Limits of the study
6. Conclusion
7. Future work
2
2
1. Introduction
 2006 Census: 20% long form, 80% short form
 2011:
• 100% Census mandatory short form
• 30% sampled to voluntarily complete the NHS long form
 Objectives of the long form: get data to plan, deliver and
support government programs directed at target populations
 2011 common topics to both forms: demography, family
structure, language
 Additional 2011 long form topics: education, ethnicity,
income, immigration, mobility…
 NHS sample size is 4.5 million dwellings (f = 30%)
3
1. Introduction
 Non-response error in the NHS:
• Survey now voluntary => expect significant non-response
• To minimize the impact, after a fixed date restrict the collection efforts to
a Non-Response Follow-Up (NRFU) random sub-sample
U
sr
snr
s
NRFU
NRFUr
NRFUnr
 Set-up developed by Hansen & Hurwitz (1946)
1.
2.
3.
4.
4
Select 1st phase sample s from population U
Non-response snr observed in s
NRFU selected from snr
Response NRFUr and non-response NRFUnr observed in the NRFU (HH
assumed 100% resp. rate)
1. Introduction
U
sr
snr
NRFU
NRFUr



5
s
NRFUnr
When 100% of the NRFU responds (as in Hansen and
Hurwitz original setting), the NRFU can be used to
estimate without non-response bias the total in snr
This is not the case in the NHS.
However focusing the collection efforts on the NRFU
converts part of the non-response bias (that would be
observed in the full snr) into sub-sampling error
2. Handling non-response error
 The estimation method chosen to minimize the remaining
non-response bias should have the following properties:
• As few bias assumptions as possible should be made
• The method should be simple to explain and to
implement in production
 Available micro-level auxiliary data to adjust for nonresponse:
• 2011 Census short form
• Tax data
 Calibration: Agreement with Census totals is desirable from
a user’s perspective
6
2. Handling non-response error
 First class of contenders: Reweighting
• Usual method used to compensate for total non-response
in social surveys
• The Hansen & Hurwitz estimator of a total
ˆtHH   yk 
sr
 ak

NRFU
yk
ak
k s
nr
is unbiased if 100% of the NRFU answers
 When the assumption does not hold, we must model the
last non-response mechanism/phase and reweight
accordingly…
7
2. Handling non-response error
 Scores method:
• Model the probability of response with a logistic
regression
• Form Response Homogeneity Groups (RHG) of
respondents and non-respondents with similar predicted
response probabilities
• Calculate the response rate in each RHG and assign
these new predicted response probabilities to respondents
• Divide the NRFUr weights by this probability:
ˆtscores   yk 
sr
8
 ak

NRFU r
yk
 ak k s pˆ kRHG
nr
2. Handling non-response error
 Second class of contenders: Imputation
• Usual method to compensate for item non-response
• We will consider nearest-neighbour imputation using the
CANadian Census Edit & Imputation System (CANCEIS) only
1. Partial imputation: Impute only non-respondents to the
subsample (NRFUnr) and use reweighting to take sampling into
account
ˆtpartial   yk 
sr
 ak

yk
NRFU r
 ak k s
nr


NRFU nr
yˆk
 ak k s
nr
2. Mass imputation: Impute all non-respondents (snr/NRFUr)
tˆmass 
9

sr  NRFU r
yk
 ak


snr  NRFU rc
yˆ k
 ak
2. Handling non-response error
 Some pros & cons
Method
Scores
Preserves micro-level information of
non-respondents
10
Partial
imputation
Mass
imputation
√
√√
Does not create synthetic information
√√
√
Uses less heavy non-response
hypotheses
√√
√√
Fully takes sub-sampling design into
account
√√
√√
Census systems available
√√
√√
More calibration to known Census
totals can be done
√
√√
3. Simulation set-up
 Use 2006 Census 20% long form sample data
 Restricted to Census Metropolitan Area (CMA) of
Toronto
 Simulation aimed at preserving the properties of the
NHS (except for the f = 30%):
• Non-response to the 1st phase was simulated by
deterministically blanking out the data of the 63% of
respondents who answered last in 2006
• Of these non-respondents, the 78% who answered first will
have their response restored if they are selected in the NRFU
sub-sample
• NRFU sub-sampling was simulated by selecting a stratified
random sample of 41% of snr
11
3. Simulation set-up
 Estimators calculated
• As points of reference, unbiased estimators:
ˆt2006   yk
s
ˆtHH   yk 
 ak
 ak
sr

NRFU
yk
ak
k s
nr
• As contenders:
ˆtscores   yk 
sr
 ak
tˆmass 
12
 ak

sr  NRFU r

 ak k s pˆ kRHG
NRFU r
ˆtpartial   yk 
sr
yk

nr
yk
NRFU r
yk
 ak

 ak k s

nr

snr  NRFU rc

NRFU nr
yˆ k
 ak
yˆk
 ak k s
nr
3. Simulation set-up
 The scores method
• A single logistic regression was done for the whole CMA of
Toronto
• Household response probability was predicted
• Considered for stepwise selection: household-level variables,
our best attempt at summarizing the person-level information
and one paradata variable
• R-square of 26%
• 13 RHG formed with predicted probabilities ranging from 29%
to 95%
13
3. Simulation set-up
 Imputation methods
• Nearest-neighbour imputation done with CANCEIS
• RHG is defined by household size
• The distance between non-respondents and donors
(respondents) is defined by weighting each household-level,
person-level and paradata characteristics in the distance
function
• Preference is given to donors who are geographically close
• For each non-respondents, a list of donors is made and one is
randomly selected with probability proportional to a measure
of size (1st phase weight for mass imputation, score method
weights for partial imputation)
14
3. Simulation set-up
 M=84 non short form characteristics over the various topics
 Average relative difference:
•
Calculated at the CMA level:
100 M tˆj  tˆ2006 j

M j 1 tˆ2006 j
100 M tˆj  tˆHHj

M j 1 tˆHHj
• At the Weighting area (953 WA in total) level within the CMA:
100
953M
15
953 M

i 1 j 1
tˆij  tˆ2006ij
tˆ2006ij
100
953M
953 M

i 1 j 1
tˆij  tˆHHij
tˆHHij
4. Results
 Errors at the CMA and WA levels for Toronto
CMA
WA
Point of comparison
Point of comparison
Full firstphase
Hansen & Hurwitz
estimator
Mass imputation
Partial imputation
Scores method
16
Hansen &
Hurwitz
Full firstphase
Hansen &
Hurwitz
0.94
0.00
22.98
0.00
2.97
N/A
24.56
N/A
2.25
1.52
26.69
13.22
2.03
1.45
26.77
18.67
5. Limits of the study
 Results:
• The simulation only includes one replication of the subsampling and non-response mechanisms
• Non-response bias is the measure of interest, but errors
were presented
• Non-response mechanisms were generated
deterministically. Should they be generated
probabilistically?
• The 2011 sampling, non-response and available data (ex:
paradata) cannot be replicated exactly
• Only totals studied. What about other parameters such as
correlations?
17
5. Limits of the study
 Possible confounding effects:
• Logistic regression was done at the aggregated level of the
CMA and no WA effect or interaction were considered
• Paradata for imputation is more closely related to nonresponse mechanism (give preference to late respondents
in the distance)
• Weighting of donors in imputation has an impact
• Calibration done from sample to U; calibration at inner
levels/phases could help scores and partial imputation
18
6. Conclusion

With these preliminary results, it seems scores
method is doing well at aggregate levels, while partial
imputation is doing better than scores at finer levels
•
•
•
19
Mass imputation: Can you override the known sub-sample
design with an imputation model?
Partial imputation: Can include more information (personlevel, paradata) than scores, but weighting of each
component in the distance is partially data driven and not
straightforward
Scores method: More difficult to include the information, but
variable selection to explain non-response is direct
7. Future Work

Possible:
•
•
•

Definite:
•
•

20
Replicate sub-sampling and imputation more than once to
isolate bias components
Consider other levels of calibration in the comparisons
Hybrid of scores and partial imputation
Implement a method into NHS production
Estimate the errors and variances (multi-phase, large sampling
fractions, errors due to modeling,…) and educate data users
Important to get a good model for the last nonresponse mechanism. Whatever the method, quality of
the results is a function of the auxiliary data available.
For more information,
please contact:
François Verret - SSMD/DMES
Francois.Verret@statcan.gc.ca
(613) 951-7318
21
Download