A Theoretical Framework for Adaptive Collection Designs

advertisement
A Theoretical Framework
for Adaptive Collection
Designs
Jean-François Beaumont, Statistics Canada
David Haziza, Université de Montréal
International Total Survey Error Workshop
Québec, June 19-22, 2011
Overview
 Selected literature review
 Framework
• Definition of the problem
• Choice of quality indicator and cost function
• Mathematical formulation of the problem
 Solution and discussion
 Conclusion
2
Literature review:
Groves & Heeringa (2006, JRSS, Series A)
 Responsive designs: Use paradata to guide
changes in the features of data collection in order
to achieve higher quality estimates per unit cost
• Paradata: Data about data collection process
• Examples of features: mode of data collection, use of
incentives , …
• Need to define quality and determine quality indicators
• Two main concepts: phase and phase capacity
3
Literature review:
Groves & Heeringa (2006, JRSS, Series A)
 Phase: Period of data collection during which the
same set of methods is used
• Phase 1:
gather information about design features
• Phases 2+:
alter features (e.g., subsampling of
nonrespondents, larger incentives, …)
 A phase is continued until its phase capacity is
reached
• Judged by the stability of an indicator as the phase
matures
4
Literature review:
Schouten, Cobben & Bethlehem (2009, SM)
 Goal: determine an indicator of nonresponse bias
as an alternative to response rates
 Proposed a quality indicator, called R-indicator:
R(ρ)  1  2  Pop.Std.Dev.( i , i U ) , 0  R(ρ)  1
• Population standard deviation must be estimated
• Response probabilities, i , must be estimated using
some model
 An issue: indicator depends on the proper choice
of model (choice of auxiliary variables)
5
Literature review:
Schouten, Cobben & Bethlehem (2009, SM)
 Another issue: indicator does not depend on the
variables of interest but nonresponse bias does
1  R(ρ)  S (y)

ˆ
 Maximal bias of  NA :
2
 ˆ is the unadjusted estimator of the population
NA
mean:
ˆNA   is wi yi
r

isr
wi
 Two limitations of maximal bias (and R-indicator):
• unadjusted estimator is rarely used in practice
• depends on proper specification of
6
i
Literature review:
Peytchev, Riley, Rosen, Murphy & Lindblad (2010, SRM)
 Goal: Reduce nonresponse bias through case
prioritization
 Suggest targeting individuals with lower estimated
response probabilities
• For instance, give them larger incentives or give
interviewer incentives
• Their approach is basically equivalent to trying to
increase the R-indicator (or achieving a more
balanced sample)
 Recommend using auxiliary variables that are
associated with the variables of interest
7
Literature review:
Laflamme & Karaganis (2010, ECQ)
 Development and implementation of responsive
designs for CATI surveys at Statistics Canada
 Planning phase:
• before data collection starts (determination of strategies,
analyses of previous data, …)
 Initial collection phase:
• evaluate different indicators to determine when the next
phase should start
 Two Responsive Designs (RD) phases
8
Literature review:
Laflamme & Karaganis (2010, EQC)
 RD phase 1:
• prioritize cases (based on paradata or other information)
with the objective of improving response rates
• increase the number of respondents (desirable)
 RD phase 2:
• prioritize cases with the objective of reducing the
variability of response rates between domains of
interest (increasing R-indicator)
• likely reduce the variability of weight adjustments
(desirable)
9
Literature review:
Schouten, Calinescu & Luiten (2011, Stat. Netherlands)
 First paper to propose a theoretical framework for
adaptive survey designs
 Suggest:
• Maximizing quality for a given cost; or
• Minimizing cost for a given quality
 Requires a quality indicator (e.g., overall response
rate, R-indicator, Maximal bias, …)
• Which one to use?
10
Definition of the problem
 Adaptive collection design: Any procedure of
calls prioritization or resources allocation that is
dynamic as data collection progresses
• Use paradata (or other information) to adapt itself to
what is observed during data collection
• Focus on calls prioritization
 Our objective: Maximize quality for a given cost
 Context: CATI surveys
11
Choice of quality indicator
 Focus of the literature: Find collection designs
that reduce nonresponse bias (or maximize Rindicator) of an unadjusted estimator
 We think the focus should not be on nonresponse
bias. Why?
• Any bias that can be removed at the collection stage
can also be removed at the estimation stage
 We suggest reducing nonresponse variance of an
estimator adjusted for nonresponse
12
Quality indicator
 Suppose we want to estimate the total:  

iU
yi
 Assuming that nonresponse is uniform within cells,
an asymptotically unbiased estimator is:
wgi
ˆ
 A  is
ygi
rg ˆ
g
g 1
G
with ˆ g 
nrg
ng
 Quality indicator: The nonresponse variance
 
2
varq ˆA s     g1  1  ng  1 Swy
,g
G
g 1




 g  Eq ˆ g s  Eq nrg s ng
13
Overall cost
 Overall cost: CTOT   g 1 CTOT , g
G
CTOT , g 
  (m
isrg
gi
 1)CNR , g  CR , g  

isg  srg
mgi CNR , g
mgi :total number of attempts for unit i
CNR , g :cost of an unsuccessful attempt
CR , g :cost of an interview
14
Expected overall cost
 Expected overall cost:
CTOT  Eq  CTOT s    g 1 CTOT , g
G
CTOT , g   CR , g  CNR , g  ng  g  CNR , g  mgi
isg


mgi  Eq mgi s  m  pgi , M gi 
Assumption : mgi does not dependon g
G
CTOT  0   1g ng  g
15
g 1
Mathematical formulation
 Objective: Find g , g  1,..., G, that minimizes the
nonresponse variance
var ˆ s
q
 
A
subject to a fixed expected overall cost, CTOT  K
 Solution:
 Note:
16
1
2
2
 1  ng1  S wy

,g
  S wy , g
g  
1g


Equivalent to maximizing the R-indicator
only in a very special scenario
Implementation
 Find the effort egi (number of attempts) necessary
to achieve the target response probability g
egi 
ln(1   g )
ln(1  pgi )
 Procedure: Select cases to be interviewed with
probability proportional to the effort egi
 Issues: 1) Avoid small estimated pgi to avoid an
unduly large effort egi
17
2) Might want to ensure that a certain
time has elapsed between two
consecutive calls
Graph of variance vs cost
Minimum nonresponse variance
18
Expected overall cost
Revised solution
 Solution of the optimization problem is found
before data collection starts
 May be a good idea to revise the solution
periodically (e.g., daily)
• Some parameters might need to be modified
• Update remaining budget and expected overall cost
• The revised optimization problem is similar to the initial
one
19
Revised solution
 Solution (same as before):
2
 1  ng1  S wy

,g

g  
1g


1
2
 Revised target response probability:

ng  g  nrg
 g 
Could be negative

ng  nrg
 Effort:
20
ln(1   g )
egi 
ln(1  pgi )
Conclusion
 Next steps:
• Simulation study
• Adapt the theory for practical applications
• Test in a real production environment
 Which quality indicator? Nonresponse variance?
Others?
 Reduction of nonresponse bias: subsampling of
nonrespondents
• Our approach could be used within the subsample
21
Thanks - Merci
 For more information,  Pour plus d’information,
veuillez contacter :
please contact:
Jean-François Beaumont (Jean-Francois.Beaumont@statcan.gc.ca)
David Haziza (David.Haziza@umontreal.ca)
22
Download