Nonresponse in Survey Research

advertisement
Nonresponse in survey
research: why is it a
problem?
Robert Voogt
Dutch Ministery Of Social Affairs and Employment
(formerly of the University Of Amsterdam)
Overview
• What is nonresponse, why is it a problem and why does the
traditional way of correcting for nonresponse not solve the
problem
• Overview of general correction techniques
• An alternative approach to correct for nonresponse bias
• Real life illustration
2
What is nonresponse, why is it a
problem and why does the traditional
way of correcting for nonresponse
not solve the problem?
3
Survey research
• Population is sampled
• Sample is a good representation of population when good
sample techniques are used
• Not all sample elements will respond
4
Unit vs Item nonresponse
• Some are not reached, others refuse or are not sending back
the questionaire: unit nonresponse
• Some who do answer the questionnaire do so incompletely:
item nonresponse
5
MCAR, MAR, MNAR
3 general nonresponse mechanisms can be distinguished
• MCAR: Missing Complety At Random
• MAR: Missing At Random
• MNAR: Missing Not At Random
6
Missing Completely At Random (MCAR)
• Conditional distribution M given the survey outcomes Y and
survey design variables Z. Let f(M|Y,q) denote the
distribution, with q the unknown parameters.
• If MCAR: f(M|Y,Z,q) = f(M|q) for all Y,Z,q
• Not a realistic assumption
7
Example MCAR
• Taking a random subsample of a group of nonrespondents
• If random subsample of nonrespondents is analysed (after
obtaining answers of all of them), the nonsampled
nonrespondents can be said to be MCAR
• So correction methods using the MCAR assumption can be
used
8
Missing At Random (MAR)
• MAR: f(M|Y,Z,q) = f(M|Yobs,Z,q) for all Ymis,q
• where Yobs denotes all the observed survey data
• This means that missingness depends on the observed
variables, the observed values of incomplete variables or on
the design variables, but not on the variables or values that
are missing
9
Example MAR
• For both respondents and nonrespondents we know their
level of education
• Respondents who share the same value of level of
education have the same distribution on the unobserved
variables
• Most survey nonrespondent adjustment methods assume
MAR
10
Not Missing At Random (NMAR)
• NMAR: f(M|Y,Z,q) = f(M|Yobs,Ymis,Z,q) for all Yobs,Ymis,q
• This means that missingness depends on missing values
after conditioning on the observed data
• To get an unbiased distribution M, a joint model of the data
and the nonresponse mechanism is necessary
11
Example MNAR
• For both respondents and nonrespondents we know their
level of education
• Given the level of education nonresponse on the variables
of interest is not random
• This means it is not sufficient to use only level of education
to correct for nonresponse bias.
12
Nonresponse bias
If nonresponse is not a result of design, almost always NMAR is
the case, with data biased by nonresponse as a result.
The amount of nonresponse bias is dependent on:
1. the correlation between the target variable(s) and the
nonresponse mechanism;
2. the level of nonresponse.
13
Nonresponse bias
1
C ( r Y ) 
N
N
 (r
k 1
k
 r (Yk  Y )
with
Yk: the score of element k in the population on the target variabele
rk: probability of response of element k in the population when contacted in the
sample
C(r,Y): population covariance between response probabilities and the values of
the target variable
14
Nonresponse bias
1
C ( r Y ) 
N
N
 (r
k 1
k
 r (Yk  Y )
with
• (Yk-Y): the difference between the population score and the score of element k on the
variabele of interest
• (rk-r: the difference between the mean probability to respond and the probability
to respond of element k
• It follows from this equation that the response level in itself does not say everything:
the amount of bias depends on the relation between the first and second part of the
equation
15
Traditional correction methods
• Use population information to compare to the respondent group with
the population
• Use information that is available for both respondents and
nonrespondents
• Use information about the difficulty to obtain data from the
respondents
• In fact, the assumption is that the data are MAR, given the values of
the variables of which population information or information about
the nonrespondents is available
16
Traditional correction methods
• No information about the difference on the variables of
interest between the respondents and nonrespondents
• No information about the difference in response
probabilities between sample elements that score different
on the variables of interest
• So there is no reason why this way of correcting should
work
17
Overview of general correction
techniques
18
Different correction techniques
• Weighting: assigning each observed element an adjustment
weight
• Extrapolation: respondents who are most like the
nonrespondents are used for correction
• Imputation: missing values are substituted by estimates
19
Weighting
• Weighting: assigning each observed element with an
adjustment weight
• Sample elements that belong to groups that seem
underrepresented on the variables used in the weighting
will have a high adjustment weight
• Sample elements that belong to groups that seem
overrepresented among the respondents will have a low
adjustment weight
20
Weighting Example
• Question: Have you ever
visited Lugano? (Y/N)
• Population information
available about age (18-30 )
(31-64 ) (65-older)
• Comparison of respondents
and population
• Weighting
Age
Resp
Popul
Weight
18-30
20%
30%
30/20=1.5
31-64
70%
50%
50/70=0.7
65+
10%
20%
20/10=2.0
Lug
18-30
31-64
65+
Unw
W*
Yes
20%
(4)
50%
(35)
10%
(1)
40%
(40)
33%
(33)
No
80%
(16)
50%
(35)
90%
(9)
60%
(60)
67%
(67)
N
20
70
10
100
100
Yes: 4*1.5 + 35*.7 + 1*2.0=6+24.5+2=32.5
No: 16*1.5 + 35*.7 + 9*2.0=24+24.5+18=66.5
21
Extrapolation
• Central idea: some groups of respondents are more like the
nonrespondents than others are
• For example, sample elements that first refused, but when
contacted for the second time, were persuaded to
participate, can be used as proxies for the final refusals
22
Extrapolation Example
Lug
R1
R2
TR
• Question: Have you ever
visited Lugano? (Y/N)
Yes
48%
28%
40%
(29)
(11)
(40)
• Two respondent groups: early
respondents and late
No
52%
72%
60%
respondents
(31)
(29)
(60)
N
60
40
100
• Calculate the distribution
among the nonrespondents
Last respondent: L=A2+(A2-A1) (X2-X1/X2), with:
using the last respondent
L: theoretical last respondent
method
NR
TS
20%
(10)
33%
(50)
80%
(41)
67%
(100)
50
150
A: % response to an item in a wave
X: cumulative % respondents at the end of a wave
L = 50+(50-40) (67-40/67) = 50+*.40=18%
23
Imputation
• Imputation: missing values are substituted by estimates
Different methods of imputation:
• Single Imputation: for each variable one value is imputed
• Hot Deck Imputation: a missing value is replaced by an observed
value of a comparable respondent
• Multiple Imputation: for each variable several values are imputed; in
this way the uncertainty that imputation brings with it is also taken
into account
24
Hot Deck Imputation Example
• Divide the respondents into homogenous groups. For exampe, by
using CHAID.
• CHAID recursively partitions a sample into groups so that the
variance of the dependent variable is minimized within groups and
maximized among groups
• Link each nonrespondent to the group it fits in best
• Substitute the values of a random respondent from the same group as
the value of the nonrespondent
25
Hot Deck Imputation Example, part 2
CHAID finds
groups:
age 18-30,
31-64/low
education,
31-64/high
education,
65+/male
and
65+/female
Grp
18-30
R
20% (4)
HDI NR
25*.20 =5
TS
9
31-64/low
31-64/high
65+/male
65+/female
33% (10)
63% (25)
20% (1)
0% (0)
4* .33 =1
1* .63 =1
8* .20 =2
12* .0 =0
11
26
3
0
% Lug Yes
40% (40)
18%(9)
33% (49)
100 50
150
26
Multiple Imputation Example
• For each case, 5 values for each missing variabele are calculated,
using a regression equation and adding a random error term
• These values are combined in one single value, for example, by taking
the mean
• The variance will take the uncertainty due to the imputed value into
account by combining the within imputation variance (the variance of
each estimated data set) and the between imputation variance (in
which all 5 data sets are used)
27
Multiple Imputation Example, part 2
Percentage that has visited Lugano
Imp1
.41
Imp2
.56
Imp3
.34
Imp4
.62
Imp5
.44
Mean
.47
NR 2
NR 3
NR 4
….
.67
.28
.02
.77
.11
.10
.81
.07
.06
.56
.15
.23
.64
.22
.09
.69
.17
.10
NR 50
.21
.32
.46
.16
.20
.27
NR 1
TNR
.33
28
An alternative approach to correct for
nonresponse
29
Key to succes of correction methods
• The information used in the correction method
• The correction method must model the nonresponse
mechanism
• The variables used in correction should have a relation with:
– the variables of interest
– the probability to respond of a sample element
30
Central Question Method
(Betlehem & Kersten, 1984)
• Nonrespondents are asked to answer one (or more)
questions central to the subject of the study
• The central questions are believed to have a strong relation
with both the nonresponse process and the subject of the
study
• Central questions are used in correction
31
Central Question
Example
• Central Question: Have you ever
visited Switzerland? (Y/N)
• Question of interest: Have you ever
visited Lugano? (Y/N)
• Comparison of respondents and
non-respondents
• Weighting as correction technique
CQ
Resp
Nonr
TS
Weight
Yes
60%
10%
43%
43/60=0.72
No
40%
90%
57%
57/40=1.43
N
100
50
150
Lug
CQ:Y
CQ:N
Unw
W*
Yes
67%
(40)
0%
(0)
40%
(40)
29%
(29)
No
33%
(20)
100%
(40)
60%
(60)
71%
(71)
N
60
40
100
100
Yes: 40*.72 + 0*1.43 = 28.8 + 0 = 29
No: 20*.72 + 40*1.43 = 14.4 + 57.2 = 71
32
Real Life Illustration
33
Illustration
• Election study
• High levels of nonresponse
• External information available to test the succes of the
correction procedures
34
Our research questions
• Does nonresponse causes a problem in election studies?
• Is using background variables sufficient or do we need
central questions?
• Do different correction techniques lead to different results?
• Is it really necessary to recontact nonrespondents?
35
Data Collection
•
•
•
•
•
City of Zaanstad, The Netherlands
N=995; 901 used
Recontacting refusals
Mixed mode data collection
Two central questions:
– Voted in 1998 national elections
– Political interest
36
Response rate
Method
Telephone
Mail
Face-to-face
Complete question.
Central questions
Complete question.
Central questions
Complete question.
Central questions
Nonresponse
Total sample
N
452
81
94
27
158
37
52
901
%
50.2
9.0
10.4
3.0
17.5
4.1
5.8
100
37
Does nonresponse cause problems?
We distinguish four groups:
• Response at first contact (470)
• Response after two contacts (76)
• Response after three or four contacts (158)
• Nonrespondents (including those who answered the central
questions) (197)
38
Comparison of response groups
R1
R2
R3
NR
Voted nat. elections
86
70
60
62
Voted prov. elections
47
46
25
29
Interested in politics
79
76
55
27
Voting not important
9
17
38
-
Conclusion: nonresponse bias is present
39
How to correct?
Using the Central Question Procedure and compare it with
more traditional correction methods
Two central questions:
• Voted at national elections (0-1) – from election lists (so no
response bias)
• Political interest (0-1) – from short nonresponse
questionnaire
40
Correction methods
• Weighting by background variables / + central questions
• Extrapolation
• Hot Deck Imputation by background variables / + central
questions
• Multiple Imputation by background variables / + central
questions
for response levels of 52 % and 78 %
41
Weighting
• On background variables: age, ethnicity, gender, household
composition, education, residential value, number of years
living in current residence, social cohesion in neighborhood;
using an iterative procedure
• As above plus validated voter turnout national elections
1998 and political interest (central questions)
42
Extrapolation
• Last Respondent Method
43
Hot Deck Imputation
• Obtain subgroups by using CHAID
• Assign nonrespondents to the groups
• Decide exact value to be imputed using a regression model
(multiple imputation)
• For background variables / background variables and
central questions
44
Multiple Imputation
• Use AMELIA (King et al., 1998) to calculate 10 discrete imputation
values for each variable
• Calculate the mean distribution by summing the 10 proportions of
each of the categories of the variable and divide it by 10
• Compute variance to take both within- and between-imputation
variance into account
• For background variables / background variables and central
questions
45
Dependent variables
•
•
•
•
Voted at national elections
Voted at provincial elections
Self-reported political interest
Importance of voting
46
Results for weighting
52%
78%
Rsp
BG
CQ
Rsp
BG
CQ
TS
Voted national
85.5
83.3
74.5
78.0
77.5
74.5
74.5
Political Interest
78.8
78.0
65.2
73.0
72.1
65.2
65.2
Voted provincial
47.4
46.1
40.6
42.3
42.0
40.1
39.5
Importance Voting
69.5
68.7
63.4
59.7
59.5
56.5
-
Rsp: Respondents, BG: Background variables
CQ: Central Questions, TS: Total sample
47
Compare different methods
Rsp
W
HDI
MI
78%
BG
BG
BG
Voted National
78.0
77.5
77.8
75.7
Political Interest
73.0
72.1
72.2
Voted
Provincial
42.3
42.0
Importance Voting
59.7
59.5
EX
W
HDI
MI
TS
CQ
CQ
CQ
73.0
74.5
75.4
75.1
74.5
71.7
69.2
65.2
65.6
64.6
65.2
42.8
42.5
39.0
40.1
41.1
41.2
39.5
59.7
57.8
52.7
56.5
56.2
56.1
-
Rsp: Respondents, W: Weighting, HDI: Hot Deck Imputation, MI: Multiple Imputation. EX:
Extrapolation, BG: Background variables, CQ: Central Questions, TS: Total sample
48
Relations: regression turnout provincial elections
52
BV
CQ
78
BV
CQ
Resp
W
W
HDI
MI
Resp
W
W
HDI
MI
TS
VtNat
*
*
*
*
*
*
*
*
*
*
*
Age
*
*
*
*
*
*
*
*
*
*
*
*
Urb
*
Sex
Educ
*
*
*
*
*
*
*
*
*
Ethn
Value
Mobil
*
*
Cohe
49
Conclusions
• Using cental questions lead to better estimates than only using
background variables
• Higher response levels lead to better estimates
• All correction techniques perform equally well: the information
used in the correction is more important than the technique used
• Correcting bias in regression parameters is less succesful
50
Recommedations
• Always reapproach nonrespondents, to try to reach a response level
of 75 %
• Always ask (a sample of) nonrespondents to answer a small number
of central questions
• Always try to get as much information as possible from external
sources
• The technique used is not so important – simple techniques perform
equally well as more complex ones.
51
Thank you for your attention!
• Questions?
• Contact: robertvoogt@gmail.com
52
Download