Section 3: Survey Weighting

advertisement
An ounce of planning is worth of a pound of weighting: Measuring cash holdings
from the 2013 Bank of Canada Method-of-Payments Survey
Heng Chen, Bank of Canada
Chris Henry, Bank of Canada
Kim P. Huynh, Bank of Canada
Q. Rallye Shen, Bank of Canada
Kyle Vincent, Bank of Canada
Abstract
This article details the methodology used in the Bank of Canada 2013 Method-ofPayments Survey to measure cash and non-cash payments. Measuring cash holdings is
difficult due to the anonymous nature of cash and that some populations are hard-toreach. To ensure that the survey is a representative sample, a variety of methods in
survey design, weighting, and variance estimation are used to estimate a measure of
cash holdings in the Canadian population. Overall, we find that Canadians hold on
average about 84 dollars.
Keywords
Sample weighting, non-probability samples, calibration, raking, variance estimation.
Acknowledgements: We thank Ben Fung, Geoff Gerdes, Catherine Haggerty, Arthur
Kennickell, Marcos Sanches, and numerous colleagues at the Bank of Canada for their
useful comments and encouragement in undertaking this survey. Maren Hansen
provided excellent editorial assistance. We acknowledge the tremendous collaboration
and support from Shelley Edwards, Jessica Wu, and Ipsos Reid for their dedication to
this study. Finally, we thank Statistics Canada for providing access to the 2011 National
Household Survey and the 2012 Canadian Internet Usage Survey. The views of this
paper are those of the authors and do not represent the views of the Bank of Canada.
Section 1: Introduction
The Bank of Canada has an interest in understanding the levels of cash holding as it is
the sole issuer of Canadian bank notes. Measuring the amount of cash holdings is
difficult, however, as cash is an anonymous and untraceable payment method. Therefore
in 2009, the Bank of Canada undertook a Method-of-Payments Survey, which found that
demographic factors are strongly correlated with cash usage. Arango, Huynh, and
Sabetti (2011), in their analysis of these factors, find that an increase in the availability
and usage of non-cash payment methods such as debit, credit, and even mobile
payments make holding cash seem quaint among younger, educated, and wealthier
demographics.
The 2009 MOP introduced a three-day payment diary, which served as a memory aid to
record all payments, including those in cash. This payment diary methodology has been
successfully used in six other countries: Austria, Australia, France, Germany, the
Netherlands, and the United States; more details are available at Bagnall et al (2014).
In 2013, the Bank of Canada conducted the Method-of-Payments Survey (MOP) to
measure cash usage in Canada. The 2013 MOP is a mixed-mode survey involving both
paper and online collection methods. An Online sample was selected from a market
access panel accessible by email and an Offline sample from a panel accessible by
regular mail. In addition, a subsample was taken from another comprehensive annual
household survey, the Canadian Financial Monitor (CFM), which collects data on
household finances from approximately 12,000 Canadian households per annum. The
CFM survey instrument collects information that complements and overlaps with the
MOP, providing a complete picture of household finances. In total, 3600 surveys were
collected for the 2013 MOP from across the country and then weighted to ensure that
the sample is representative of the Canadian population.
Surveys of the aforementioned type require post-stratification (sample weighting) as
they are likely to be based on non-probability or, at best, highly unequal probability
samples. For example, in the 2009 MOP the 18-24 year-old males were the most difficult
to recruit and had the lowest response rates. Hence, as there is a large degree of
heterogeneity in the recruitment procedure, this type of study benefits from sample
calibration. The objective of this paper is to describe the methods undertaken to ensure
that the 2013 MOP is a representative sample.
The chief methods used consist of:
1) Revamping and redesigning the survey to ensure that it is user-friendly and easy
to fill out. Users report a satisfaction rating of about 90 percent on the survey.
2) Engaging respondents using a mixture of financial incentives and appeals to civic
duty, via an official letter from the Governor of the Bank of Canada.
3) Subsampling the CFM, whose data collection is conducted on a rolling monthly
basis. This provides our own survey the advantage of periodic updates.
4) Collaborating with the survey company to ensure that the field work achieves
preset population targets. The survey company hit all the targets with the help of
an additional boost wave, in which additional invitations were sent out promising
high incentives for response.
5) Cleaning and editing the data using external data sources such as the market
access panel demographic profile, and verifying the subsample with the CFM. As
a result, the level of missing data is quite low, at only about 1-3 percent, and so
only light imputation is required.
6) Using raking methodology to conduct post-stratification weighting, using
external administrative and large-scale survey data from Statistics Canada. The
weights from the post-stratification are invariant to initial weights.
7) Using resampling methods with bootstrap replicate survey weights (BRSW) for
the variance estimation. Usage of BRSW results in a decrease of 20-50 percent in
the variance of the estimate.
The rest of this paper describes in detail these procedures. Section 2 discusses points 15, Section 3 discusses the post-stratification methods, and Section 4 highlights the
resampling methods for variance estimation. Finally, Section 5 concludes.
Section 2: The 2013 Method-of-Payments
The 2013 Method-of-Payments Survey is an update to the 2009 MOP. For further
information on the 2009 MOP survey, refer to Arango and Welte (2012); for the 2013
MOP, see Henry, Huynh, and Shen (2014). Indeed, planning for the survey included
incorporating lessons learned from both the 2009 MOP and the CFM.
One important lesson from the 2009 MOP is that certain hard-to-reach populations were
under-represented in the final sample. This led to zero/low cell counts, which in turn
gave rise to extreme weights. While sampling targets were achieved for marginal
demographic counts, missing cell counts at a nested level caused difficulties for the
weighting process.
Several measures were implemented to ensure that the sampling procedure in the 2013
MOP would avoid this problem. First, established sampling targets were nested by
region, age, and gender. Thus we knew beforehand, for example, how many males aged
18-24 from the Prairies region of Canada were required for the sample to reflect the
Canadian population. These pre-defined targets, built into the statement of services for
the survey company, facilitated the ongoing monitoring of returns during data
collection. Frequent updates allowed us to project which cells were likely to have an
excess or shortage. Finally, various levels and types of incentives were randomly
offered to potential respondents, which allowed us to determine the most effective
combination. Financial incentives ranged from $5 to $20; other types of incentives
included an advanced letter signed by the Governor of the Bank of Canada requesting
participation, and a token $2 coin included in the survey package regardless of whether
or not the respondent participated.
Collaboration with the survey company, Ipsos Reid, was important to ensure that these
tools were effectively employed to hit the nested sampling targets. Ipsos Reid provided
almost daily updates to establish up-to-date projections for the final returns. During
data collection, certain cells were identified as in danger of being under represented in
the final sample. Through timely collaboration with Ipsos Reid and senior management,
an additional sampling wave was added for the offline recruitment, which utilized the
full spectrum of high-level incentives – the $20 completion incentive, advanced letter,
and token $2 incentive. As a result, we were able to hit (and exceed) all nested targets,
and ensure that no zero cells would impede the calibration.
The other main innovation of the 2013 MOP was to leverage the existing CFM survey, via
the survey instrument and the method of data collection. By comparing the 2009 MOP
with the CFM we discovered that the surveys contain some overlapping content. This led
to a sub-sampling of past CFM respondents (with the sampling frame provided by Ipsos
Reid) in addition to maintaining the Online and Offline panels employed for the 2009
MOP. For certain aspects of the questionnaire – specifically/for example, information on
a respondent’s bank account(s) and credit card(s) – the respondent had already
provided the desired information in the CFM; this allowed us to shorten the 2013 MOP
questionnaire, which likely had a positive impact on response rates.
Other topics in the 2009 survey, such as cash usage, were maintained in the 2013 MOP
but made directly comparable to questions in the CFM. This sub-sampling approach
proved very successful in recruiting respondents, with a response rate of over 50
percent. Furthermore it provides an external benchmark with which to compare
measures of consumer cash holdings/usage.
Section 3: Survey Weighting
Data collection methods for large scale surveys are almost always biased. Two common
explanations are that the sampler does not have access to a full sampling frame, and that
the sampler does not have full control over the sample selection procedure. However,
sample calibration/balancing is a post-stratification procedure that can be used to reweight sample data so that it is more representative of the target population (Sarndal,
2007). Such sample weighting can facilitate more accurate estimates of population
unknowns.
The calibration procedure leverages the availability of national-level counts of
auxiliary/demographic information, so as to balance the sample to these counts. For
example, in our calibration analysis we make use of national-level counts based on the
2011 National Household Survey (NHS) and the 2012 Canadian Internet Use Survey
(CIUS) for a variety of demographic variables. Calibration over such information offers
users the ability to reduce non-response and non-coverage bias effects. Consequently,
the resulting estimators should be less biased than those based on the un-weighted data
(Kish, 1992) when calibration is used.
In our calibration analysis we follow a series of steps to arrive at a suitable set of
calibration weights. We summarize the process with the flowchart in Figure 1 and
provide a detailed breakdown below. We then conclude the section with a reflection on
non-response weights.
Subsection 3.1 Outline of calibration analysis.
Stage A: We first consider a set of potential calibration variables that include both
demographic and technological-oriented variables. These are chosen based on their
conjectured relationship with important survey questions. A round of data editing and
cleaning is undertaken, and imputation of missing values in the calibration variables is
achieved with the aid R package called “mice” package (van Buuren, 2012).
As recruitment came from three separate market access panels, a comparative analysis
is performed using the Epps-Singleton test for homogeneity on the demographic
variables. The three subsamples are found to be fairly homogenous in composition and
we therefore concatenate them into one final sample for the calibration analysis.
A correlation analysis over the calibration variables is conducted, in order to determine
any potential collinear variables to eliminate from the analysis. As the calibration
variables are classified as categorical or ordinal variables, we use the polychoric
correlation measure (Drasgow, 2004) with the aid of the R package “polycor” (Fox,
2010) to compute the correlations.
Stage B: The calibration variables are found to be only mildly to moderately correlated,
and hence no variables are eliminated from the analysis. With respect to nesting/pairing
calibration variables, we pair the two most correlated variables with each other, as well
as the gender variable with several other calibration variables. Nesting allows us to
avoid sparse cells. For example, the gender variable is often paired with other
calibration variables because it is a binary variable. This pairing can avoid small cell
counts while accounting for disagreements between the joint sample and national
distributions.
A range of calibration techniques can be used; see Deville et al. (1993) for a
mathematically detailed account of some commonly used procedures. We select/choose
the raking and generalized regression (GREG) procedures as these are popular methods
among both national statistical agencies and academics; see Sarndal (2007) and the R
Survey package by Lumley (2010).
Two sets of proposed initial weights for the raking method are based on both a simple
random design and a stratified sampling design based on several key demographics. The
correlation of the two sets of generated weights, when using the raking algorithm and
the full list of calibration variables, is high. Hence, we conclude that the final weights are
likely to be insensitive to the choice of the initial weights, and therefore base them on
the simple random sampling design.
Stage C: The raking and GREG procedures are evaluated based on the range of weights
they generate. With numerous combinations of calibration variables, the GREG
procedure gives rise to a number of negative weights. Furthermore, since raking is a
popular post-stratification method that is widely used in the statistics profession, we
choose to use the raking method.
Subsection 3.2 Non-response weights.
The issue of non-response is a common concern among survey practitioners; see Kish,
(1992) for more details. Typically, non-response will bias and inflate the variance of the
estimates and increase survey costs (as follow-ups can be expensive). However, several
procedures can be used to account for such non-random, non-response issues.
In some cases, typically when the quantity of non-response is small, imputation can be
prescribed to resolve such issues. However, determining a suitable model and
imputation strategy can be resource intensive and computationally expensive, as
surveys are comprised of many questions. Instead, the original calibration weights can
be adjusted to compensate for the non-response issues.
Further, when responses are missing completely at random (MCAR) (Rubin, 1976) and
non-response counts are small, one approach is to base estimation on a rescaling of the
original calibration weights for those who have responded. However, the non-response
pattern will usually be such that it cannot be viewed as MCAR. A common approach in
such a case is to assume that the responses are missing at random (MAR) (Rubin, 1976).
In other words, the responses are MCAR within strata/classes of the survey
respondents. This assumption allows the computing of estimates for the response
probabilities, possibly based on a logistic regression or propensity scoring model, and
appending them to the original calibration weights. Essentially, a respondent 𝑖 will
receive an original calibration weight 1/𝑃𝑖 and an estimated response weight 𝑟ℎ
depending on their stratum membership ℎ. Their corresponding weight, for the survey
question under consideration, would then be 𝑤𝑖 = 1/(𝑃𝑖 𝑟ℎ ).
The aforementioned approach comes with precautions. As noted by Sarndal (2007), an
inherited bias in the procedure is likely when estimating the actual response
probabilities. Hence, a high level of caution should be exercised when positing/exploring
response model(s). However, rigorous methods have been developed to approximate
potential non-response bias as a function of the responses and the national-level
covariate information (Sarndal and Lundstrom, 2008); with such methods, suitable
calibration variables can be chosen to reduce the non-response bias.
Subsection 3.3 Non-probability samples
In an empirical setting, non-probability sampling is a common concern, as latent factors
may influence selection probabilities. (Consider the discussion above on nonresponse).
When estimation is based on non-probability samples, the population attributes are
typically assumed to be distributed somewhat evenly so that the sample weighting
(through a posited probability sampling design and sample calibration) can still provide
accurate results.
Web-based studies are gaining popularity due to their convenience and efficiency in
obtaining data. Almost all of these surveys rely on convenience or volunteer sampling,
both of which are non-probability sampling designs. Apart from legally mandated
surveys such as national census most surveys rely on volunteer sampling. Such
limitations have been acknowledged in the literature, and the demand is increasing for
calibration methods to improve the accuracy of results based on such samples; see
Schonlau, et al., (2007) and Disogra et al. (2011).
Internet and mobile usage information from nationwide surveys presents much
potential for use in calibration analyses. We (therefore?) exploit such demographic
information from the 2012 CIUS for our calibration analysis, namely ownership of a
mobile device and online payment activity, as some of the MOP survey questions are
oriented to/relate to mobile means-of-payment. More details about the/this weighting
procedure are available in Vincent (2014).
Section 4: Variance Estimation
Since the 2013 MOP employs stratified random sampling (Section 2), and survey
weights are applied to ensure a representative sample (Section 3), variance estimation
should take both the sampling design and calibration procedure into account.
Heuristically speaking, the variance depends on the weighting procedure, not just on the
numerical values of the weights. However in most payment surveys, variances are
usually calculated by taking the calibrated weights as fixed values, thereby biasing the
variance estimates. In order to capture the randomness from both sampling design and
weight calibration, we propose a resampling method, specifically the bootstrap replicate
survey weights (BRSW).
For example, if the weight calibration is raked over the external variables x, then the
estimated variance of the population-based weighted y variable is:
̂ (𝑦) = ∑
𝑉𝑎𝑟
𝑘,𝑙∈𝑆
𝜋𝑘𝑙 − 𝜋𝑘 𝜋𝑙 𝑦𝑘 − 𝑥𝑘 𝑏 𝑦𝑙 − 𝑥𝑙 𝑏
𝜋𝑘𝑙
𝜋𝑘
𝜋𝑙
where 𝜋𝑘 is the k-th unit’s inclusion probability in the sample, 𝜋𝑘𝑙 is the pairwise
inclusion probability of the k-th and l-th units, and the parameter b is the OLS estimate
from regressing y on x. The first term is proportional to the sampled covariance for the
k-th and l-th units. The second and third terms form the fitted residuals.
Our reasons for choosing resampling over the linearization method are as follows. First,
most software packages use linearization as if the weights were fixed, which does not
allow for model-based information (b), but rather uses 𝑤𝑘 or the calibrated weight in
the denominator. Although the correct linearization method is suggested by Lu and
Gelman (2003), their method can be difficult to implement when sample weights are
complicated functions of sample sizes within strata. (Only under simple random
sampling does 𝑤𝑘𝑙 have a straightforward formula depending on 𝑤𝑘 and 𝑤𝑙 .) Second, the
linearization estimators use the initial weights. Hence this requires the survey dataset to
include both the calibrated weights and the base weights, which may confuse users.
Third, the end user of the data must be given the set of strata variables, which may not
be possible if confidential variables are used in calibration. These complications cause
resampling to be the more popular method for estimating variances when calibrated
weights are used (Shao 1996, Kolenikov 2010).
Among the many resampling methods for a complex survey, we choose to use BRSW.
We prefer the bootstrap to either the jackknife or balanced repeated resampling (BRR),
because the jackknife is inconsistent for non-smooth functions (e.g. the median
estimate), and BRR is more suitable for a stratified clustered sampling design, which
was not used in our 2013 MOP Survey. As for choosing to use replicate survey weights,
we do so to protect the privacy of survey respondents (no strata information will be
provided), and because replicate weights incorporate information about corrections for
non-response bias (weights are re-adjusted for each replicate).
The construction of the BRSW involves first re-creating the sample in each replicate and
then adjusting the associated calibrated weights. For example, if a unit from a replicate
is not sampled, a zero weight is assigned to it, and then the weights of the other units in
the same stratum are expanded to compensate. In the next step, the weight calibration is
applied to each of these replicate sets of weights. These two steps generate the
bootstrap replicate weights. We use the bsweights package in Stata to implement this
method (Section 3.2 in Chen and Shen, 2014).
Table 1 shows mean and variance computations from the 2013 MOP and 2013 CFM data.
As a basis of comparison, we decompose the 2013 MOP Survey sample statistics into
online, offline and CFM subsamples. Overall, the cash holdings from the weighted 2013
MOP Total mean of 83.68 is lower than 2013 CFM survey of 94.67. Part of this finding is
driven by the MOP Online mean cash holding which is about 80.06 and constitutes about
about one-third of the sample. The MOP offline mean of 91.10 is closer to the 2013 CFM
mean. Part of this reason is that it is drawn from the offline access panel. The CFM
subsample respondents provide a point of comparison as they are participants of both
the 2013 MOP and 2013 CFM. The average cash holdings for this overlap set of
respondents is 83.41 (2013 MOP) and 89.82 (2013 CFM). Overall, the 2013 MOP
estimate of cash holdings is lower than the 2013 CFM. Part of this discrepancy maybe
due to the timing of the survey as the 2013 MOP was conducted in October-November
2013 while the 2013 CFM was sampled from a subset of January-August 2013
participants.
We compute the variance based on linearization and the BRSW methodology. In the
second row, the variances are calculated by linearization without considering the
weighting procedure, while in the third row the variances are calculated by BRSW. The
BRSW variances are much smaller than those calculated by linearization, because the
resampling method takes into account the weight calibration procedure, which is
applied after the sample is collected.
Note that the paper-based (MOP CFM and MOP offline) BRSW variances improve about
20-25 percent, while the MOP online BRSW variance improves by over 50 percent. A
plausible explanation would be three-fold. First, the cash-on-hand variable is more
correlated to the raking variables (e.g. online payment) for the online respondents than
for the paper-based ones. By computing the R squared for different subsamples, we find
that the R squared from the online panel is highest at 0.0179, compared to 0.0048 and
0.0087 for the offline and CFM panels, respectively. Second, since the sample variance
also depends on the sample size, the relatively larger sample size for the online
subsample could also drive the sizable improvement.
Table 1: Mean and Variance Estimates for Average Cash on Hand
Raw
2013 MOP
Weighted
2013 CFM
Weighted
CFM
CFM
Total Total Online Offline Subsample Subsample Total
Mean
87.62 83.68 80.06 91.10
83.41
89.82
94.67
VarLin
15.84 13.27 48.51 50.66
25.81
7.60
4.40
VarBRSW
N/A 7.76 17.90 41.10
20.09
N/A
N/A
R-squared
N/A 0.009 0.018 0.005
0.009
N/A
N/A
Observations 3413 3413 1294
679
1440
1440
12280
Notes: Average cash on hand is measured in Canadian dollars. Statistics are based
on respondents reporting positive figures (i.e. excluding zero responses). VarLin
is the linearized variance and VarBRSW is the variance based on bootstrap
replicate survey weights. R-squared is the goodness of fit from the raking
variables regression.
This exercise demonstrates that if we ignore the random fluctuations due to the weight
calibration in the variance estimation, the resulting variances will tend to be
conservative (too large) and the confidence intervals too long, their coverage exceeding
the pre-specified nominal level.
Section 5: Conclusion/Discussion
Our experience from the 2013 Method-of-Payments Survey dictates/highlights that
survey weighting should not be viewed as a panacea to the ills of a poorly designed
survey. Rather, a well-thought-out survey design and preparation will go a long way to
ensuring that the survey is representative. Namely, we suggest that survey teams:
1) Use mixed-mode survey methods so that each mode can be used to validate and
verify the others. In addition, various trusted external data should be brought in
to help calibrate the survey collection methods.
2) Use the methods espoused by Dillman (2007) to induce higher response rates.
3) Work closely with the data collection agency to ensure that the objectives are laid
out in advance. Closely monitor all fieldwork and collaborate to head off any
difficulties.
4) Conduct post-stratification using a variety of methods, but ensure that the
methods are robust and make sense. Again, use a variety of external data to
conduct post-stratification.
5) Compute variance estimation using resampling methods, as they result in
variance reduction as well as provide a way to anonymize the sampling design.
Overall, we found that the 2013 MOP the average cash holdings is about 84 dollars
which is about 10 dollars less than what was found in the 2013 CFM. Understanding the
source of difference in estimates is left for future work.
References
Carlos Arango, Kim Huynh, Leonard Sabetti, 2011. "How Do You Pay? The Role of
Incentives at the Point-of-Sale," Working Papers 11-23, Bank of Canada.
Carlos Arango, Angelika Welte, 2012."The Bank of Canada’s 2009 Method-of-Payments
Survey: Methodology and Key Results," Discussion Papers 12-6, Bank of Canada.
Chen, H. and Shen, R. (2014), Variance Estimation for Survey-Weighted Data using
Resampling Methods: 2013 Method-of-Payment Survey Questionnaire. Technical
Report.
Deville, J. C., Sarndal, C. E., and Sautory, O. (1993). Generalized Raking Procedures in
Survey Sampling. Journal of the American Statistical Association 88, 1013-1020.
Dillman, D. Internet, Mail and Mixed-Mode Surveys: The Tailored Design Method, 3rd ed.
(2009)
Disgora, C., Cobb, C., Chan, E., and Dennis, M.J. (2011). Calibrating non-probability
internet samples with probability samples using early adopter characteristics.
Proceedings of the Section on Survey Research Methods, JSM.
Drasgow, F. (2004). Polychoric and Polyserial Correlation. John Wiley and Sons, Inc.
Fox, J. (2010). Polycor: Polychoric and Polyserial Correlations. R package version 0.7-8.
Henry, C, Huynh, C. and Shen, Q. Rallye (2014). 2013 Method-of-Payments Survey
Report.
Kish, L. (1992). Weighting for unequal pi. Journal of Official Statistics 8, 183-200.
Kolenikov, S. (2010). Resampling variance estimation for complex survey
data. The Stata journal 10 (2):165–199.
Lu, H. and Gelman, A. (2003). A method for estimating design-based sampling variances
for surveys with weighting, post stratification, and raking. Journal of Official Statistics 19
(2):133–151.
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. John Wiley & Sons,
Ltd.
Rubin, D. (1976). Inference and missing data. Biometrika, 63 p. 581-592.
Sarndal, C.-E. (2007). The calibration approach in survey theory and practice. Survey
Methodology 33, 99-119.
Sarndal, C.-E and Lundstrom, S. (2008). Assessing auxiliary vectors for control of nonresponse bias in the calibration estimator. Journal of Official Statistics, 24 (2), p. 167191.
Schonlau, M., Van Soest, A., & Kapteyn, A. (2007). Are 'webographic' or attitudinal
questions useful for adjusting estimates from web surveys using propensity scoring.
RAND Corporation.
Shao, J. (1996). Resampling methods in sample surveys (with discussion). Statistics
37:203–254.
van Buuren, S. (2012). Flexible Imputation of Missing Data. Chapman and Hall/CRC
Press.
Vincent, Kyle (2014). 2013 Method-Of-Payments Survey Calibration Manual. Technical
Report.
Figure 1: Survey Weighting Workflow
Download