ACCURACY & COMPLETENESS IN CONSUMER FILE DATA Can

advertisement
Can Marketing Data Aid Survey Research?
Examining Accuracy and Completeness in Consumer File Data
Josh Pasek*
University of Michigan, 105 S. State Street, 5413 North Quad, Ann Arbor, MI, USA 48109.
Phone: +1-734-764-6717. Email: jpasek@umich.edu.
S. Mo Jang
University of South Carolina, 600 Assembly St., Carolina Coliseum RM4011, Columbia, SC,
USA 29201. Phone: +1-858-775-4978. Email: mo7788@gmail.com.
Curtiss L. Cobb III
GfK Custom Research, LLC and Facebook, 1 Hacker Way, Menlo Park, CA, USA 94025. Phone:
+1-559-284-0866. Email: ccobb@fb.com.
J. Michael Dennis
GfK Custom Research, LLC, 2100 Geng Road, Suite 210, Palo Alto, CA, USA 94303. Phone:
+1-650-288-1930. Email: jmdstat@yahoo.com.
Charles DiSogra
Abt SRBI, 275 Seventh Avenue, Suite 2700, New York, NY, USA 10001. Phone: +1-617-3864070. Email: C.DiSogra@srbi.com.
RUNNING HEADER:
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
6,751 words
3 Tables
2 Figures
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
2
JOSH PASEK is assistant professor of communication studies and faculty associate, Center for
Political Studies, Institute for Social Research, at the University of Michigan, Ann Arbor, MI,
USA. S. MO JANG is assistant professor in the School of Journalism and Mass
Communications, at the University of South Carolina, Columbia, SC, USA. CURTISS L. COBB
III was senior director of survey methodology at GfK Custom Research at the time research was
conducted and is currently research scientist at Facebook, Menlo Park, CA, USA. J. MICHAEL
DENNIS was managing director of government and academic research at GfK Custom Research,
Palo Alto, CA, USA. CHARLES DISOGRA is chief survey scientist at Abt SRBI, New York,
NY, USA. Data for the study were provided by GfK Custom Research, LLC. *Address
correspondence to Josh Pasek, University of Michigan, 105 S. State Street, 5413 North Quad,
Ann Arbor, MI, USA 48109; email: jpasek@umich.edu.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
3
Can Marketing Data Aid Survey Research?
Examining Accuracy and Completeness in Consumer File Data
Abstract
Survey research depends crucially on its ability to collect data from a targeted sample and
for that sample to mirror the population of interest. Increasingly, survey firms are using data
purchased from marketing firms such as Experian and Acxiom (consumer file marketing data) as
a means to improve correspondence between survey respondents and the general public. These
data hold tremendous promise, not only for sampling at a reduced cost, but also for allowing
researchers to adjust biases that often occur across groups in traditional survey research. Though
these new techniques are gaining momentum and currency, there is to date no published research
comparing marketing data to more traditionally sampled data. The benefits from using marketing
data depend in part on whether the data are both accurate and complete. This paper is the first to
systematically assess the quality of one source of consumer file marketing data. Using a unique
dataset compiled by GfK KnowledgePanel®, we compare this source of ancillary marketing data
with self-report data on the same respondents to ask how frequently the two correspond. We also
evaluate conditions under which consumer file data are missing to determine whether patterns in
missing data might introduce systematic biases when data are analyzed. Results indicate that the
ancillary data differ from self-reported data on a variety of demographic factors. Further, data
were missing in patterns that could not be easily addressed. The findings urge caution for those
who hope to improve survey administration and design using currently available consumer file
data.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
4
Can Marketing Data Aid Survey Research?
Examining Accuracy and Completeness in Consumer File Data
Survey research depends crucially on its ability to collect data from a targeted sample and
for that sample to mirror the population of interest. To achieve these goals, survey
methodologists are constantly comparing self-report survey data with data from other sources –
known as auxiliary data – as a way to both improve data collection and to adjust for survey
nonresponse (Deville, Sarndal, and Sautory 1993; Smith 2011). Since the 1950s, for example,
surveys have been compared to benchmarking studies, such as the Current Population Survey, to
determine which groups of people may be more or less responsive when sampled (cf. Kessler
and Little 1995). Researchers have also explored how paradata (information gathered in the
process of survey administration) and linked administrative data (from official sources) might
reveal patterns about whether and how people respond to particular kinds of surveys
(Calderwood and Lessof 2009; Couper and Lyberg 2005; Sakshaug and Kreuter 2012). As
auxiliary data sources, benchmarking, paradata, and administrative records can then be used to
adjust for differences in how people respond across groups and to improve administration and
targeting during the survey process (Kreuter 2013; Smith 2011). These sources are limited,
however, in that adjustments can only be made after households have been sampled for a study,
but not prior to the sampling process.1
Recently, an additional type of auxiliary data has surfaced; researchers are purchasing
(usually though sample vendors) information from marketing companies such as Experian and
Axciom – so called “consumer file data” – which they are appending to survey samples. We refer
In the case of benchmarks, this is true because adjustments need to be made on aggregate totals or
“marginal” rather than individual-level data. In the case of paradata, this is the case because information is
only generated in the process of sampling. Finally, in the case of administrative data, respondent consent is
often required for linkage (cf. Sakshaug and Kreuter 2012).
1
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
5
to this particular type of data as ancillary data. Unlike both paradata and benchmark data,
consumer file ancillary data can improve survey design even before any responses have been
collected. This is because the consumer file data can be purchased in advance of data collection.
Consumer file databases also include considerable information about American households
ranging from demographic characteristics to partisanship. As an ancillary data source, consumer
file marketing data could allow researchers to conduct targeted sampling, to adjust for
differences between respondents and the full set of sampled households (i.e. including
nonrespondents), and even to generalize from nonprobability samples to the public (Smith 2011).
These potential advantages, however, depend in part on our ability to trust the quality of
information provided in the marketing databases.
The current study represents a first test in evaluating one set of ancillary measures in
terms of both their accuracy and completeness as an early-stage inquiry into the potential of
consumer file data to enrich our survey toolkit for sampling and weighting. Using a unique
dataset that combines survey data sourced from an Address Based Sample (ABS) with consumer
file data from a well-regarded commercial source, we examine whether data derived from both
survey self-reports and an ancillary consumer file dataset lead to similar conclusions about the
households and individuals that respond. Our ability to use these data depends on whether
ancillary data values 1) correspond with survey responses (a proxy for accuracy), and 2) are not
missing information in ways that could undermine either the sample design applications or the
conclusions derived from the data (a proxy for completeness).
Understanding Consumer File Data
Smith (2011, 393) notes that, “many databases are ‘black boxes’ that do not disclose how
they are constructed and what rules are followed.” The provider Experian® (2013), for example,
reports that the data come “from more than 3,500 original public and proprietary sources” and
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
6
that the data provided “is tested and validated”. Ironically, they call this process “black box
analytics” (Tewksbury and Roy 2012). InfoUSA (2013) indicates the sources of some data,
stating that they come from places as diverse as “real estate and tax assessments,” and “voter
registration files,” but many important facets of the final data – such as specific sources for the
variables – are obscured on proprietary grounds. The firms do not provide information on how
many sources of data were aggregated, when the data were obtained, how discrepancies were
identified and prioritized, how data sources were linked to one-another, what modes of data
collection were used, and the extent to which the data presented represent inferences rather than
observations. All this challenges researchers’ ability to evaluate accuracy and completeness. This
is a fast-evolving area where customer requests from the public opinion research community can
play a useful role in encouraging commercial enterprises to provide more transparency into the
sources and construction of their consumer file data.
Consumer File Data Quality
Studies assessing other forms of auxiliary data for survey administration have highlighted
the importance of both accuracy and completeness. These concerns have been raised most
notably in comparisons between self-reported questionnaires and official records. A number of
recent studies have identified discrepancies when linking survey results with both health records
(Davern et al. 2008; Fowles, Fowler, and Craft 1998; Hebert et al. 1999) and official voter
statistics (Berent, Lupia, and Krosnick 2011). Researchers initially presumed that such
discrepancies were a product of self-report errors. Emerging evidence, however, indicates that
incorrect official records are sometimes to blame (Hebert et al. 1999; Berent, Lupia, and
Krosnick 2011). Additionally, linking surveyed individuals with official records can introduce
sources of bias (Antoni 2011). Given that even official records can introduce errors, it should
come as no surprise that less carefully collected data might introduce problems, as has been
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
7
noted in some studies of paradata (West 2013; Sinibaldi, Durrant, and Kreuter 2013). These sorts
of errors can massively complicate the goal of using auxiliary sources to improve survey
sampling (cf. West and Little 2013), especially if, as Sinibaldi et al. (2013) found, there are
differences in accuracy between survey respondents and nonrespondents. Hence, although there
are huge differences between different types of auxiliary data, their uses commonly depend on
their ability to match survey results and consistent coverage of the population.
There are a number of reasons to worry that ancillary consumer file data may suffer from
particular limitations in accuracy, completeness, and also currency.
Inaccurate data might emerge 1) because information is out of date (e.g. the residents in a
household have changed or the data describes someone’s past situation, but not their current
status), 2) because marketing data were linked to the wrong individual or household (see Winkler
2006; Yancey 2010), 3) if individuals provided data to marketing companies that were untrue
(e.g. filling out a warranty form under an assumed name), 4) if the data were inferred from other
information, but happened to be inaccurate (e.g. presuming that anyone who buys diapers is a
parent or that anyone who lives in a highly educated neighborhood is highly educated), or 5) if
there is an error in reconciling conflicting data from two or more data sources. Mismatches
between ancillary consumer file data and self-report information emerged in the one earlier
examination of this type (DiSogra, Dennis, and Fahimi 2010), but results have not yet appeared
in the peer review literature. DiSogra et al. (2010) found evidence of inconsistent accuracy
across variables, indicating the potential for serious inferential problems when both sorts of data
are used together.2
Incomplete consumer file data could result either from people who fail to provide the
kinds of information that marketing companies use (e.g. not filling out warranty forms,
A few additional conference papers have examined the use of marketing data for survey purposes, but have
not focused on the key issues of accuracy and completeness addressed here (e.g. Barron et al. 2012; Li et al.
2013; Srinath, Battaglia, and Khare 2004).
2
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
8
registering to vote, deeding property, or borrowing credit), or from an inability to confidently
determine which piece of consumer file data should be linked to a particular household or
individual. Although DiSogra et al. (2010) noted that a large proportion of households lacked
information on some ancillary variables, they did not explore whether missing data represented
systematic (rather than solely stochastic) error. Identifying the prevalence and nature of errors in
consumer file data thus remains a critical question for understanding their potential utility. If
errors in various sources of consumer file data are sufficiently frequent and systematic, the data
may not be useful for targeted sampling or nonresponse adjustment.
Implications of Data Quality
Low quality ancillary data – whether because of misinformation or missing information –
presents large challenges for incorporation into survey research. Consider, for example, the use
of various forms of auxiliary data to improve the sampling of a group like the U.S. Hispanic
population. Traditionally, Hispanic persons have been difficult to sample because they have a
lower response rate than non-Hispanic persons (cf. Johnson et al. 2002; Perl, Greely, and Gray
2006; Zambrana and Carter-Pokras 2001). Hence, it might seem efficient to use available data to
identify this population in advance and to increase the probability that Hispanic individuals
would be sampled. This could be accomplished by targeting individuals with common Hispanic
surnames (e.g. Davern et al. 2007; Hazuda et al. 1986; Word and Perkins 1996), areas with high
Hispanic population density (e.g. Fiscella and Fremont 2006), or using consumer file data where
Hispanic households have been “flagged” (cf. Barron et al. 2012; Li et al. 2013; Link and Burks
2013).
If done properly, oversampling procedures might ensure that the Hispanic segment
comprised the same proportion of respondents as in the target population (cf. Kalton 2009). But
bias could enter this process if some individuals were misclassified in any source of auxiliary
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
9
data, including ancillary data (cf. Swallen et al. 1997). If proper correctives are not employed,
non-Hispanic persons who were misclassified as Hispanic might be overrepresented in the final
dataset whereas Hispanic persons who were incorrectly classified as Non-Hispanic would be
underrepresented. Collectively, this could lead us to mischaracterize the nature of both the
Hispanic segment of the sample and the larger population. In a similar vein, an inability to
characterize some individuals because of missing ancillary data could also result in inaccurate
conclusions (cf. King et al. 2001) or compel researchers to limit the share of the sample allocated
to strata dependent on ancillary data. Corrective strategies can be employed to prevent these
issues from introducing bias (Estevao and Sarndal 2006; West and Little 2013), but the benefits
of stratifying may or may not outweigh the costs (cf. Davern et al. 2007; Santos 1991; Winship
and Radbill 1994).3 Hence, the capacity for consumer file data to improve sampling similarly
depends on their accuracy and completeness.
Similar inferential limitations may confront researchers hoping to use consumer file data
to create post-stratification weights or address problematic sampling frames. Such correctives
depend on the extent to which ancillary data can discriminate between respondents and the
population as a whole (cf. Deville, Sarndal, and Sautory 1993). Although weighting adjustments
do not directly depend on the accuracy of ancillary data, most corrective tools require that the
processes generating any source of ancillary data are unrelated to distinctions between
respondents and nonrespondents (Ibrahim, Lipsitz, and Horton 2001). To the extent that some
consumer file data are inferred from other information (e.g. Greenyer 2006), this assumption is
The implications of using these kinds of data for sampling efficiency will depend on the accuracy of the
ancillary data, the relative sizes of strata, and the differences between individuals in targeted groups that
were correctly and incorrectly stratified.
3
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
10
likely to be violated.4 Missing information in consumer file data may also correlate with
distinctions between respondents and nonrespondents, which could hinder correctives as well.
To uncover the presence and potential implications of these sorts of errors, researchers
need to systematically evaluate consumer file data to understand the conditions under which they
might be useful and where they may introduce additional complications.
The Current Study
This research represents a first foray into evaluating the quality of one well-regarded
source of consumer file ancillary data for survey purposes. With the aid of a unique dataset from
GfK that links an address-based sample (ABS) with ancillary demographic data from a single
vendor, we conduct two analyses: Analysis 1 explores correspondence between ancillary data
and self-reports by comparing survey responses in the ABS sample to consumer file values about
those same households for the same variables. Analysis 2 evaluates the nature of incompleteness
in these ancillary data by investigating whether missingness in the ancillary data is ignorable or
nonignorable. The results allow us to test whether the ancillary data examined appear to provide
an accurate picture of respondents and thus whether the data might lead to improvements in
sampling and nonresponse adjustment.
Methods
Sample
Data for all analyses come from GfK Custom Research, LLC. In January of 2011, GfK
used the U.S. Postal Service’s Computerized Delivery Sequence File (CDSF) to choose 25,000
random addresses that would be recruited by mail (with telephone follow-ups where numbers
were available) for the purpose of having them join KnowledgePanel®, an online probability-
The processes distinguishing between actual values and inferred values are likely to correlate with the
distinction between respondents and nonrespondents.
4
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
11
based sample of U.S. adults. The CDSF covers over 95 percent of American households, making
it one of the broadest potential sampling frames (Iannacchione 2011).5 Because of the breadth of
the sampling frame, we could expect a 100% response rate among selected households would
closely mirror that of all American households. Addresses were chosen in four strata based on
age (household contained and 18-24 year-old person vs. all others and unknown) and Hispanic
status (an Hispanic person or surname is associated with the household vs. all others and
unknown) as predicted by the ancillary data, weights were used to correct for this decision (see
below).
Of 25,000 sampled addresses, 2,498 households were successfully recruited to join the
panel, a household response rate of 10.0% (AAPOR RR1). This response rate is in line with
many current probability sample surveys and thus represents a typical survey circumstance for
testing correspondence between data sources. Multiple individuals were allowed to sign up for
the panel from each of these households. In total, 4,472 individuals were recruited, with the
median household yielding two respondents.6
Because all panel surveys for KnowledgePanel® are completed online, GfK provided a
laptop computer, Internet access, or both to panel members for whom these devices/services
were not already available. Self-report data came from the Core Adult Profile Survey, the first
survey respondents complete upon admission to the panel. Thus, there is no gap between panel
admission and unit-level survey response. By using the first available data provided by these
panelists, we minimized the potential influence of attrition and panel conditioning.
The survey data used for the current study are intended to be illustrative of the quality of
samples used broadly by academic and commercial researchers. GfK’s KnowledgePanel has
It excludes only locations that do not receive individually addressed mail. These include institutionalized
populations (e.g. college dorms and prisons), some New York apartment buildings, and groups like the
homeless.
6 The mean number of respondents per household was 1.79. Fewer than 1% of all households yielded more
than 4 respondents and none yielded more than 10.
5
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
12
been demonstrated to have error rates similar to RDD (Yeager et al. 2011); therefore, the
findings in the current study would appear to be projectable RDD surveys of comparable quality.
Consumer File Data
Consumer file data were linked to all 25,000 households in the ABS sample.7 These data
were provided and matched to addresses by Marketing Systems Group (MSG), the firm that
produced the ABS sampling frame. MSG collates data from several sources to compile
information on the CDSF addresses; MSG was responsible for matching consumer file data to
the CDSF addresses and to one-another. All addresses in the sampling frame had consumer file
data for at least one variable. Ancillary data for the current study originally sourced from
InfoUSA, Experian, and Acxiom. Since consumer file data were themselves produced through a
combination of aggregation and inferential techniques, it was impossible to trace the source of
any particular piece of information about a particular household.
Despite the opaque nature of individual ancillary measures, the firms providing the
information to MSG suggest that these data are ideal for tracking and identifying Americans and
are used as such by some researchers or in direct mail campaigns. Acxiom (2011, 10) claims its
data “covers more than 99% of marketable addresses worldwide” and incorporates regular
updates from the U.S. Postal Service. Experian notes that it excels at linking identities between
social media sites, phone numbers (landline and mobile), work and home addresses, email
accounts, and other online identifiers (Tewksbury and Roy 2012). And InfoUSA (2013) monitors
voter registration, utility, and real estate data to compile information, with monthly updates to
keep records current. These three firms represent some of the largest and most well-respected
sources of consumer file data. The aggregation across these sources as implemented by MSG
Consumer file ancillary data is appended to GfK data, but not the other way around. GfK KnowledgePanel®
data is provided by respondents under an agreement where it cannot be linked to outside data sources in an
identifiable fashion. Data provided by respondents thus cannot directly influence the consumer file data
source, avoiding one possible confound.
7
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
13
should also reduce the number of households for which we are missing data (though this process
could result in inconsistent data quality). Hence, we would expect that aggregating across these
sources would provide one of the best potential databases for keeping track of the American
public, presuming that the data themselves are accurate.8
Variables
Homeownership, household income, and household size were measured in both datasets
at the household level. Full question wordings and coding for household-level variables in the
ancillary data are shown in Online Appendix A.9 Marital status, education, and age were
measured in both datasets at the individual level. Question wordings and coding for these
variables are shown in Online Appendix B. Table 1 presents descriptive statistics for all measures
and their related missingness count among respondents (n=4,472) and for all sampled cases
(n=25,000).
[INSERT TABLE 1 ABOUT HERE]
Weighting
Six sets of weights were generated to assess correspondence between self-reported and
ancillary distributions across both analyses. For all analyses, we produced weights to correct for
GfK’s procedure in stratifying its recruitment sample. Additional weights were then designed to
adjust for differences between individual-level self-reports and household-level ancillary data.
Correctives were created that either 1) down-weighted data from households with multiple
members (for which two approaches were examined) or 2) that sampled only the individual in
each household whose age most closely corresponded with the age in the ancillary data (which
Because of the nature of these data, we are not able to diagnose the source of discrepancies between
ancillary and self-reported data. We also cannot conclude that other sources of ancillary data or other sources
of survey data would not result in different results. Such comparisons should be a subject of continuing
research.
9 Substantive variables used at both the household level and the individual level were those for which
measurement categories could be matched across self-report and ancillary measures. Three ancillary
measures for presence of a telephone, race/ethnic status, and number of children in the household, were
excluded because the match between these measures in the datasets would have been inconsistent.
8
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
14
was examined with four additional methods). Because the substantive findings did not vary
across the six weighting strategies, which are described in full in Online Appendix C, we present
only results from those respondents in each household who most closely corresponded with the
age values from the ancillary data for all analyses.
Analysis 1: Correspondence between Ancillary Data and Self-Reports
Analytic Method
To evaluate agreement (our proxy for accuracy) between self-reports and the ancillary
data examined herein, we compared the distributions of variables in the ancillary data with
corresponding measures in the self-reported survey results. This evaluation proceeded in a threestep process. We first assessed the extent to which self-reported and ancillary data revealed
consistent information about households (and individuals). High agreement rates indicated
relatively accurate ancillary data whereas low agreement rates would call that accuracy into
question.10 Second, to understand whether the ancillary and self-report data differed in
systematic ways, we explored whether ancillary measures tended to provide systematically larger
or smaller values than corresponding self-reports. If larger or smaller values were
disproportional, the results would suggest that misclassifications were a product of bias rather
than random measurement error. Finally, for measures with more than two categories, we
assessed the proportion of responses that differed by a large margin – defined as greater than five
years for age and two or more categories for ordinal measures. These “far-off” cases were
unlikely to be a product of ancillary data that was simply out of date.
Although some of these discrepancies could occur as a product of inaccuracies in survey values, we think
this is not a major concern for two reasons. First, there is little evidence of systematic biases in reporting
demographic variables in surveys (Calahan 1968; Weaver 2000). Second, web-based survey administration
appears to minimize biases (see Chang and Krosnick 2010), further mitigating this likelihood.
10
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
15
Results
Correspondence between ancillary and self-reported values differed markedly across
variables. The reports largely agreed in cases like homeownership, marital status, and age, with
89.0%, 73.1% and 67.0% agreement, respectively (Table 2). In contrast, household income,
household size, and education differed enormously between the data sources (with 22.1%,
27.8%, and 27.2% corresponding, respectively). There was no apparent pattern discriminating
between variables with high and low correspondence. These results can be compared with
around 90-95% of individuals who report consistent values for changeable demographic
variables from year to year (Smith and Stephenson 1979).
Ancillary and self-reported measures corresponded at different rates across levels of the
same variables, a pattern indicative of bias. Looking only at households with both types of data,
homeownership, marital status, and income were consistently higher in the ancillary data relative
to self-reports. 23.4% of households self-reported that they did not own their homes, but fully
28.9% of these cases were classified as homeowners in the ancillary data. In contrast, only 5.5%
of self-reported homeowners were classified as non-owners in the ancillary data (see Table D1 in
Online Appendix D). Perhaps most troublingly, of the 38.8% of individuals who reported they
were unmarried in the self-reports, more than half (51.8%) were classified as married in the
ancillary data. Yet among individuals reporting that they were married, 89.9% were identically
classified in the ancillary data.
Self-report and ancillary values did not always diverge in similar patterns. Individuals
tended to self-report lower incomes and educational achievement than was apparent in the
ancillary data whereas discrepancies in reports of household size and age did not skew as
strongly in a single direction. Overall, there did not seem to be a clear pattern for when the
values from the two data sources differed in these manners (Table 2; Online Appendix D).
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
16
[INSERT TABLE 2 ABOUT HERE]
Large discrepancies between data sources emerged frequently for most variables (see
Table 2, % Far-off). Some 43.1% of households reported income that differed from the category
suggested in the ancillary data by more than $10,000 per year. The number of occupants reported
by a household differed by two or more individuals from the ancillary value in 35.1% of cases.
One in four individuals, 24.7%, reported an education level that differed by two or more
categories from the ancillary value. And 19.9% of self-reported ages differed from the ancillary
value by six or more years, even though respondents were selected for the closest age match.
Such large discrepancies seem unlikely to have emerged from slightly outdated consumer file
data.
We also computed correlations between consumer file and self-reported measures of each
variable to test whether data for continuous measures may have differed in some systematic way
between the two sources. This could happen if ages were consistently out-of-date or if incomes
were overstated in the ancillary data. The correlations (r) varied considerably, but tended to be
moderate in strength (ranging from a single low of r=.19 for household size and a range of .39
to .73 for all others). Hence, it seems unlikely that a single source of systematic error was
responsible for the discrepancies observed.
Discussion
Correspondence between the ancillary data and self-reports varied across the six variables
examined. Disparities between self-report and ancillary results were fairly large for income,
household size, and education; the data streams were generally more consistent when assessing
homeownership, though notable biases emerged. At a minimum, the findings indicate that the
ancillary data used may not be particularly accurate in their description of individual or
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
17
population parameters. However, ancillary information was considerably better than chance
determinations for all variables.
In considering uses of ancillary data, current results present a mixed picture. For
example, the data could either help or hinder researchers who wish to use the information to
better target demographic groups that are traditionally underrepresented in surveys. 11 Typically
this process involves an attempt to sample individuals from such groups at disproportionately
high rates. Yet identifying targeted individuals has long proven a challenge, because we
ordinarily do not know anything about households or individuals before they are sampled. Prior
studies have stratified samples using tools such as lists of ethnic surnames (e.g. Davern et al.
2007; Fiscella and Fremont 2006) or heavily-minority Census tracts (ANES 2013; Kalton 2009)
to increase the proportion of respondents in these groups. Such strategies can increase the error
in a survey estimate even when proper weighting is applied (due to increases in variance).
Hispanic persons with traditionally Hispanic surnames might have different experiences from
those with names that are more difficult to classify. Similarly, African-American individuals who
live in predominantly African-American neighborhoods may have very different experiences
from those who live in predominantly White neighborhoods. Hence, this kind of targeted
sampling procedure can introduce bias unless proper weights are applied; researchers ignoring
this bias when making inferences could reach inaccurate conclusions concerning both targeted
groups and society as a whole.
Counteracting the potential for bias when using targeted sampling can be complicated. It
may undermine the efficiency gained by using the ancillary (or other auxiliary) data in the
sampling process. Two sets of weights must be applied to prevent bias (Estevao and Sarndal
2006). First, one set must equalize the probability of selection between individuals in the
The term “rare population” is often used to discuss targeted sampling of traditionally underrepresented
subpopulations; we avoid that term here in favor of targeted groups because it is also feasible to alter the
sampling ratio for large groups within the population (even a majority) based on the use of auxiliary data.
11
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
18
ancillary-defined target groups that were sampled at disproportionately high rates versus “all
others” in all other ancillary-defined groups. This step ensures that the easy to classify
individuals do not end up defining the category, but also eliminates much of the benefit from the
disproportionate sampling. Second, weights can be used to increase representation of the target
group (as defined by self-reports) to bring that group back to population proportion. When
agreement between ancillary and self-reported values is low, this can actually increase the
variance in the estimates and thus the expected error because misidentified individuals in the
target population will have even higher weights than they would have without targeted sampling
(cf. Santos 1991; Winship and Radbill 1994). The overall precision of a targeted sampling
strategy of this sort could either go up or down depending on agreement between the ancillary
and self-reported measures as well as the size of the sampling strata.
Analysis 2: Missingness in Ancillary Data
Analytic Method
Three tests were used to assess the scope and nature of missingness in the ancillary data.
First, we examined the extent of missingness in the ancillary data for each variable and across
cases. Second, we compared missingness in ancillary data variables to self-reports for those same
measures. Differential missingness across self-report categories would provide strong evidence
of nonignorable missingness. Finally, we used logistic regressions to predict the presence of
missingness for each of the ancillary variables based on the values of self-report measures and an
OLS regression to predict the number of ancillary variables for which data were missing across
cases. Presumably, if ancillary data were Missing Completely at Random (MCAR), missing data
should not be concentrated among specific cases, should be unrelated to self-reports of those
same variables, and should be impossible to predict precisely with the logistic regressions. If
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
19
ancillary data were Missing At Random (MAR), in a way that could be predicted using
observable variables, we should see a strong ability for the logistic regressions to predict
missingness. Strong relations between self-reports and ancillary missingness on the same
variables, coupled with a relative inability to predict missingness in regressions, would indicate
that missingness was almost certainly nonignorable.
Missingness
Missingness indicator variables were created to identify cases missing ancillary data for
each of the variables of interest (0 = presence of data, 1 = absence of data). A total missingness
variable was defined as the sum of the six missingness indicators for each case (ranging from 0
to 6).
Descriptive Statistics
Ancillary data were missing for a large number of cases. On average, 16.5% of
households were missing data for any given ancillary data variable. Missingness varied
considerably across households from only 3.6% of cases missing for household size to 28.5% of
households missing age information (Figure 1, histogram a).
[INSERT FIGURE 1 ABOUT HERE]
We also explored how missing ancillary data varied across respondents. Although the
modal household was not missing data for any of the ancillary variables (44.7%; Figure 1,
histogram b), missingness was not concentrated in only a few households; only 10.3% were
missing data for more than two variables (Figure 1, b). Of households missing data, the vast
majority was missing information for only a single variable. To examine whether cases lacking
data on one variable were also likely to have missing data on other variables, we conducted a
reliability test among six missingness indicator variables. Cronbach’s Alpha was .59, indicating a
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
20
moderate consistency among missingness indicators, but not enough to consider them a single
factor.
Ancillary Missingness by Self-Report Value
Missing ancillary data appeared distinctly nonrandom when compared with self-reported
household status on the same variables. Ancillary homeownership was missing for 4.9% of selfreported home-owning households and 33.2% of non-owning households (Figure 2, a; 2(1) =
273.8, p < .001). Ancillary income was missing systematically for households with lower
reported incomes. 13.5% of ancillary data were missing for households reporting an income of
below $35,000 per year whereas only 3.3% of ancillary data were missing for households with
incomes above $150,000 (Figure 2, b; 2(7) = 56.7, p < .001). Smaller households were missing
more information about household size than were larger households (Figure 2, c; 2(4) = 11.7,
p<.05). These results refuted the possibility that these ancillary data were MCAR and indicated
that missing ancillary household data were likely to be nonignorable.
[INSERT FIGURE 2 ABOUT HERE]
Rates of missingness in individual-level ancillary data also frequently depended on selfreports for the same variables. When respondents reported that they were married, only 13.3% of
ancillary marital status data were missing. In contrast, 36.3% of ancillary marital information
was missing among unmarried individuals (Figure 2, d; 2(1) = 118.4, p<.001). Variation in
missingness across self-reported education levels was not statistically significant (Figure 2, e;
2(4) = 3.9, p = .84). Finally, missing ancillary age information was more common among
younger individuals. For individuals aged 18-24, 62.6% of ancillary age data were missing; only
12.7% of ancillary age data were missing for 55-64 year olds (Figure 2, f; 2(6) = 292.2, p
< .001). As with household-level variables, missing data for individual-level variables appeared
systematic.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
21
Regressions
To understand patterns of missingness, regressions predicted the presence of missingness
for each of the ancillary data measures and the total number of missing ancillary variables as a
function of each respondent’s status on self-reported demographics. To conduct these
regressions, multiple imputations were used to account for missingness in self-reported values
among all respondents.12 All regression results were weighted as described above.
Predictors of missing ancillary data varied depending on the missingness indicator being
predicted. Three self-reported demographics predicted multiple missingness indicators:
homeownership, household size, and age. Compared to non-owners, self-reported owners were
less likely to lack ancillary data for homeownership, household income, marital status, and age
(Table 3, row 1). The self-reported number of persons in the household predicted missing
ancillary income, marital status, and age, with larger households translating into a reduced
likelihood of missingness. Age predicted nonlinearly; middle-aged Americans were the least
likely group to be missing information for marital status or age. Self-reported marital status and
education each predicted one of the missingness indicators. Married individuals were, ceteris
paribus, less likely than unmarried individuals to be missing information on marital status.
Individuals who reported that they had less than a high school education (the omitted category)
were the most likely to be missing ancillary age information.
[INSERT TABLE 3 ABOUT HERE]
Despite significant predictors for most missingness indicators, missing ancillary data was
not well predicted in the current analyses. The McFadden’s pseudo R2 for missingness indicators
12Multiple
imputations were conducted by using Multiple Imputations via Chained Equations predicting each
missing value with all self-reports and ancillary demographic variables (Buuren and Groothuis-Oudshoorn
2011). Running these same regressions without imputations led to the same conclusions. Imputed versions
were used to avoid the possibility that missingness in self-reports would bias the results. Hence, values for all
self-reported variables were imputed. Imputations were conducted to mirror the full set of respondents (not
all sampled households). Imputations were only used for regression models.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
22
was always below .20, suggesting that we could not effectively account for when data were
missing,13 and the full list of covariates improved the percent of cases correctly predicted over a
null model for only one of the six indicators: missing ancillary age information (Table 3).
Although the lack of strong prediction does not indicate that these covariates were irrelevant,
they do imply that we have an incomplete understanding of the circumstances under which
ancillary data were missing.
Using an OLS model, predictions of the number of ancillary measures missing have
revealed similar challenges. Missingness remained poorly explained even though evidence
suggested that overall missingness was related to less self-reported homeownership, larger
household sizes, unmarried status, and relative youth (Table 3, column 7). These variables again
captured only a small portion of the variance across individuals (R2 = .12), though the model
may be limited due to the small number of individuals missing multiple ancillary measures.14
Discussion
Results of analysis 2 suggested that the ancillary data examined were missing in ways
that could be problematic for survey research; specifically, they appeared to represent a
nonignorable source of bias. Missingness was more common for some measures than for others
and varied across categories of self-reports for the same variables. This presents a series of
problems for researchers hoping to use ancillary data for sampling or to correct for known survey
errors as relevant data may be missing.
Patterns of missingness in the ancillary data were difficult to predict with simple
covariates or regression techniques. Because missing data appeared to violate both Missing
McFadden’s Pseudo R2 is an estimate of the proportion of variance accounted for in the tested model as
compared to a null model with only an intercept. The statistic represents the proportion of the total log
likelihood of the null model that is explained by the fitted model. It is calculated as 1-( ln(Lfitted)/ln(Lnull)). For
most purposes, it can be interpreted similarly to an R2 statistic and reveals the approximate proportion of the
residual variance in a null model that is explained by the inclusion of all predictors.
14 Prediction may be slightly better than reported, however, given that the data are left-skewed and do not
meet assumptions of normality implicit in OLS regression. Notably, however, no additional variance was
explained by treating missingness as a negative binomial, indicating that such gains are likely to be minimal.
13
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
23
Completely At Random (MCAR) and Missing At Random (MAR) assumptions, the use of these
ancillary data for analytic techniques could result in substantive error. For example, only 37% of
18-24 year householders could be identified using the ancillary age data. Conclusions using only
this group might not mirror the 63% of 18-24 year olds whose households lacked ancillary age
data. Researchers hoping to use ancillary data such as those presented here for either
oversampling or corrections might therefore be well advised to conduct an analysis of how
sensitive their use of the data would be to violations in assumption about the data’s accuracy and
completeness.
The most pernicious of our results indicated that missingness in these ancillary data was
often related to self-reported values of the same variables. This is a major problem because it
means that errors from the missing ancillary data may only be apparent once self-reports have
been collected. Of course, this undermines one of the biggest advantages in the use of these
consumer file data – namely that they could be used prior to the sampling process. Because
ancillary missingness correlated with self-reports, sampling strategies based on this set of
ancillary data would likely result in sampled units that are differentially accurate across different
variables (e.g. we do a much better job identifying homeowners across variables than we do at
identifying home renters). Instead of reducing variance in the weights required for such a sample,
oversampling with the use of ancillary variables instead necessitates a two-stage weighting
process (i.e. oversampled young people need to be down-weighted before actual young people
can be adjusted to match their population proportions).15 Depending on the specific variables and
Readers should note that this corrective will only work if all oversampled individuals are downweighted
(regardless of whether they are actually in the target category or not) to correct for the stratification
procedure and if self-reports are used to weight the targeted group to match the population. Importantly, the
method for collecting population-level benchmark data on the targeted group must also match the method
used for generating this information about individuals (or households) in the survey for the final poststratification to yield an unbiased set of weights.
15
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
24
sampling ratios in play, such corrections might sometimes increase the variance and design effect
of sample weights instead of reducing them (cf. Santos 1991; Winship and Radbill 1994).
General Discussion
Use of consumer file data for sample targeting and as a survey corrective is growing
(Smith 2011). We can reasonably expect the consumer file products to improve rapidly as a
result of the commercial sector’s investment in the data sciences. There is productive role for
public opinion researchers to collaborate with the commercial sector in conducting specific tests
aimed at improving the accuracy and completeness of the consumer file data, and creating new
data products tailored to the sampling and weighting needs of the survey research field.
This is the first study to explore how one vendor source of these data derived from
multiple commercial sources compares with more traditional self-report measures. The current
analyses represent a first foray into understanding the nature of potential biases when using
ancillary marketing data to supplement (or supplant) the traditional survey process. In two
analyses we assessed accuracy and completeness in one source of consumer file marketing data.
The analyses presented provide some hopeful, but many discomforting signs for researchers
hoping to use at least the current source of ancillary consumer file data to bolster survey
research. We thus conclude that survey researchers should carefully consider the potential
implications of systematic bias and missingness in consumer file marketing data before
incorporating datasets into sampling and weighting procedures.
The consumer file data we examined was not consistently inaccurate. For some measures,
agreement rates between consumer file and self-reported information were very high. For other
measures, agreement rates were little better than chance. This pattern might emerge if some
variables or sources effectively reflect the population even while others do not. Continued
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
25
assessments will be needed to determine if certain classes of variables or sources of data provide
consistent agreement. Variables and sources where ancillary data correspond with self-reports
should provide the best chance for improving both survey administration and correction (West
and Little 2013).
Similar variability was observed in patterns of missing ancillary data in the current
analyses. Information about some variables, such as household size, was far more complete than
information about others, such as marital status. Further, patterns of missingness appeared
stochastic for some measures, such as education, whereas missingness in other measures
appeared to be highly systematic, such as age. This too presents a mixed picture. We should be
wary of the large amount of missing ancillary data in thinking about survey correctives, but
evidence of variability could imply that some ancillary measures may not pose systematic
problems. Identifying reliable measures would be of considerable value for and should be the
subject of further investigation.
Sources of Inconsistency
Inconsistencies between consumer file data and self-reports that emerged in this study are
difficult to diagnose. They could have appeared for a variety of reasons. Perhaps these ancillary
data were out of date, perhaps they were products of inference on the part of data aggregators,
perhaps the tools used to link ancillary data with addresses were flawed, or perhaps this was a
function of the aggregation procedures used at the single firm examined. Generally consistent
results with increasingly strict weighting strategies (see Online Appendix F) suggest that
mismatches between individuals within a household were unlikely to account for all
discrepancies observed. It is also possible that the ancillary measures are capturing something
fundamentally different from survey responses (though what, exactly, would be unclear) or that
survey misreporting may account for some of the discrepancies. The roots of missingness are
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
26
similarly opaque. It is clear at a minimum that researchers cannot blindly trust that both survey
and ancillary data are highly accurate when they give such discrepant results. Whether other
sources of consumer file data will provide results that consistently match survey responses
remains an important topic for future research.
Of course, we wish that we could open the black box and critically evaluate the
procedures used for each step in the process of generating the consumer file data. Because these
data are of considerable value to the private companies that aggregate them, however, social
scientists seem unlikely to gain a full picture. Meanwhile, there remains considerable reward in
evaluating the ways that even this flawed data may yet aid survey research.
The fact that an outside firm was able to match one source of consumer file data to the
entire sample and the generally decent correspondence between the ancillary and self-reported
data examined suggest that such data could prove useful for some survey purposes. Specifically,
even data that is only somewhat accurate may be able to help researchers conduct targeted
sampling for underrepresented populations and adjust for survey nonresponse. We discuss some
of these possibilities in Online Appendix E.
Limitations and Future Research
This study presents results that pertain to a handful of demographic variables from a
single source of consumer file data. They indicate that these particular data may prove
problematic for a variety of research purposes. But there is much that remains unknown. We do
not know whether similar inaccuracies and omissions might complicate the use of other sources
of consumer file data. Correspondences might differ if data were matched to addresses in a
different way, derived from a different set of sources, or collected at different points in time.
Differences might also be observed if ancillary data were compared with survey data derived
using different sampling strategies (e.g., telephone sampling) or utilized different recruitment
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
27
tools (e.g., without transitioning respondents to a web panel). Further, data on alternate types of
variables may vary in their correspondence with survey data and completeness.
We also cannot conclude that ancillary data are inappropriate for any particular use
without further examination. For most purposes, use of any type of auxiliary data – including the
ancillary consumer file data examined here – represents a tradeoff between the information that
can be gained through the use of a particular data source and the errors that emerge in the data
collection process. The value of the data for mitigating error depends on how those factors relate
to one another, not on the absolute accuracy or completeness of the sources.
Conclusions
In theory, consumer file marketing data would appear to offer a valuable resource for
improving survey design and implementation. In practice, inconsistencies between one source of
consumer file data and self-reports coupled with patterns of systematic missingness lead to
questions over how well we will be able to leverage these possibilities. The correspondence and
bias observed when comparing self-reported demographic data with the consumer file data
presented here suggest that researchers should proceed with caution. Awareness of the accuracy
and completeness of any source of ancillary information is an important prerequisite to its use.
Hence, instead of blindly assuming that these data will present an accurate portrait of the
American public, researchers should instead consider the potential improvements that the data
could offer as a set of open empirical questions. These queries, more than the ease and ability of
using consumer file data, should guide practitioners in their decision-making.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
28
References
Acxiom. 2011. “Reaching More Consumers with Certainty.” Acxiom White Paper.
ANES. 2013. “User’s Guide and Codebook for the Preliminary Release of the ANES 2012 Time
Series Study.” Electionstudies.org. Ann Arbor, MI and Palo Alto, CA: University of
Michigan and Stanford University.
Antoni, Manfred. 2011. “Linking Survey Data with Administrative Employment Data: the Case
of the IAB-ALWA Survey.” http://doku.iab.de/fdz/events/2011/Antoni_presentation.pdf.
Barron, Martin, Michael Davern, Robert Montgomery, Xian Tao, Kirk Wolter, Wei Zeng,
Christina Dorell, and Carla Black. 2012. “Can Information From Market Research
Companies Be Used to Develop an Efficient Sampling Strategy for a Rare Population?” New
Orleans, LA: Hard to Reach Conference.
Berent, Matthew, Arthur Lupia, and Jon A. Krosnick. 2011. “The Quality of Government
Records and Over-Estimation of Registration and Turnout in Surveys: Lessons From the
2008 ANES Panel Study’s Registration and Turnout Validation Exercises.” nes012554.
American National Election Studies Working Papers.
http://www.electionstudies.org/resources/papers/nes012554.pdf.
Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “MICE: Multivariate Imputation by
Chained Equations in R.” Journal of Statistical Software 45 (3).
Calahan, Don. 1968. “Correlates of Respondent Accuracy in the Denver Validity Survey.”
Public Opinion Quarterly 32 (4): 607–21.
Calderwood, Lisa, and Carli Lessof. 2009. “Enhancing Longitudinal Surveys by Linking to
Administrative Data.” In Methodology of Longitudinal Surveys, edited by Peter Lynn, 55–72.
Chichester, UK: Wiley. doi:10.1002/9780470743874.ch4.
Chang, LinChiat, and Jon A. Krosnick. 2010. “Comparing Oral Interviewing with Self-
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
29
Administered Computerized Questionnaires: an Experiment.” Public Opinion Quarterly 74
(1): 154–67. doi:10.1093/poq/nfp090.
Couper, Mick P, and Lars Lyberg. 2005. “The Use of Paradata in Survey Research.”
Proceedings of the 55th Annual Meeting of the International Statistical Institute. Sydney,
Australia.
Davern, Michael, Donna McAlpine, Jeanette Ziegenfuss, and Timothy J. Beebe. 2007. “Are
Surname Telephone Oversamples an Efficient Way to Better Understand the Health and
Healthcare of Minority Group Members?” Medical Care 45 (11): 1098–1104.
Davern, Michael, Kathleen Thiede Call, Jeanette Ziegenfuss, Gestur Davidson, Timothy J. Beebe,
and Lynn Blewett. 2008. “Validating Health Insurance Coverage Survey Estimates: A
Comparison of Self-Reported Coverage and Administrative Data Records.” Public Opinion
Quarterly 72 (2). AAPOR: 241–59.
Deville, Jean-Claude, Carl-Erik Sarndal, and Olivier Sautory. 1993. “Generalized Raking
Procedures in Survey Sampling.” Journal of the American Statistical Association 88 (423).
American Statistical Association: 1013–20.
DiSogra, Charles, J. Michael Dennis, and Mansour Fahimi. 2010. “On the Quality of Ancillary
Data Available for Address-Based Sampling.” Proceedings of the Survey Research Methods
Section of the American Statistical Association. 4174–83.
Estevao, Victor M, and Carl-Erik Sarndal. 2006. “Survey Estimates by Calibration on Complex
Auxiliary Information.” International Statistical Review / Revue Internationale De
Statistique 74 (2). International Statistical Institute (ISI): 127–47.
Experian. 2013. “Data Quality.” Experian. Accessed May 28.
http://www.experian.com/dataselect/ds-data-quality.html.
Fiscella, Kevin, and Allen M. Fremont. 2006. “Use of Geocoding and Surname Analysis to
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
30
Estimate Race and Ethnicity.” Health Services Research 41 (4p1): 1482–1500.
doi:10.1111/j.1475-6773.2006.00551.x.
Fowles, Jinnet B., Elizabeth J. Fowler, and Cheryl Craft. 1998. “Validation of Claims Diagnoses
and Self-Reported Conditions Compared with Medical Records for Selected Chronic
Diseases.” The Journal of Ambulatory Care Management 21 (1). 24–34.
Greenyer, Andrew. 2006. “Back From the Grave: the Return of Modelled Consumer
Information.” International Journal of Retail & Distribution Management 34 (3): 212–18.
doi:10.1108/09590550610654375.
Hazuda, Helen P., Paul J. Comeaux, Michael P. Stern, Steven M. Haffner, Clayton W. Eifler, and
Marc Rosenthal. 1986. “A Comparison of Three Indicators for Identifying Mexican
Americans in Epidemiologic Research.” American Journal of Epidemiology 123 (1): 96–112.
Hebert, Paul L., Linda S. Geiss, Edward F. Tierney, Michael M. Engelgau, Barbara P. Yawn, and
A. Marshall McBean. 1999. “Identifying Persons with Diabetes Using Medicare Claims
Data.” American Journal of Medical Quality 14 (6): 270–77.
doi:10.1177/106286069901400607.
Iannacchione, Vincent G. 2011. “The Changing Role of Address-Based Sampling in Survey
Research.” Public Opinion Quarterly 75 (3): 556–75. doi:10.1093/poq/nfr017.
Ibrahim, Joseph G., Stuart R. Lipsitz, and Nick Horton. 2001. “Using Auxiliary Data for
Parameter Estimation with Non-Ignorably Missing Outcomes.” Journal of the Royal
Statistical Society. Series C (Applied Statistics) 50 (3): 361–73.
infoUSA. 2013. “Data Quality.” Infousa.com. Accessed May 28. http://www.infousa.com/dataquality/.
Johnson, Timothy P., Diane O'Rourke, Jane Burris, and Linda Owens. 2002. “Culture and
Survey Nonresponse.” In Survey Nonresponse, edited by Robert M. Groves, Don A. Dillman,
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
31
J. L. Eltinge, and Roderick J. A. Little, 55–69. New York: Wiley.
Kalton, Graham. 2009. “Methods for Oversampling Rare Subpopulations in Social Surveys.”
Survey Methodology 35 (2): 125–41.
Kessler, Ronald C., and Roderick J. A. Little. 1995. “Advances in Strategies for Minimizing and
Adjusting for Survey Nonresponse.” Epidemiologic Reviews 17 (1): 192–204.
King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001. “Analyzing Incomplete
Political Science Data: An Alternative Algorithm for Multiple Imputation.” The American
Political Science Review 95 (1): 49–69.
Kreuter, Frauke. 2013. “Facing the Nonresponse Challenge.” The Annals of the American
Academy of Political and Social Science 645 (1): 23–35. doi:10.1177/0002716212456815.
Li, Ying, Whitney Murphy, Gillian Lawrence, Jennifer Vanicek, Kari Carris, and Felicia LeClere.
2013. “Hola or Hello? A Priori Assignment of Interview Language Using Demographic
Flags.” Annual Conference of the American Association for Public Opinion Research.
Boston, MA.
Link, Michael W., and Anh Thu Burks. 2013. “Leveraging Auxiliary Data, Differential
Incentives, and Survey Mode to Target Hard-to-Reach Groups in an Address-Based Sample
Design.” Public Opinion Quarterly 77 (3): 696–713. doi:10.1093/poq/nft018.
Perl, Paul, Jennifer Z. Greely, and Mark M. Gray. 2006. “What Proportion of Adult Hispanics
Are Catholic? A Review of Survey Data and Methodology.” Journal for the Scientific Study
of Religion 45 (3): 419–36.
Sakshaug, Joseph W., and Frauke Kreuter. 2012. “Assessing the Magnitude of Non-Consent
Biases in Linked Survey and Administrative Data.” Survey Research Methods 6 (2): 113–22.
Santos, Robert L. 1991. “One Approach to Oversampling Blacks and Hispanics: the National
Alcohol Survey.” Available from:
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
32
https://www.amstat.org/sections/SRMS/Proceedings/papers/1985_031.pdf.
Sinibaldi, Jennifer, Gabrielle B. Durrant, and Frauke Kreuter. 2013. “Evaluating the
Measurement Error of Interviewer Observed Paradata.” Public Opinion Quarterly 77 (S1):
173–93. doi:10.1093/poq/nfs062.
Smith, Tom W. 2011. “The Report of the International Workshop on Using Multi-Level Data
From Sample Frames, Auxiliary Databases, Paradata and Related Sources to Detect and
Adjust for Nonresponse Bias in Surveys.” International Journal of Public Opinion Research
23 (3): 389–402. doi:10.1093/ijpor/edr035.
Smith, Tom W., and C. Bruce Stephenson. 1979. “An Analysis of Test/Retest Experiments on
the 1972, 1973, 1974, and 1978 General Social Surveys.” 8. Publicdata.Norc.org. Chicago:
GSS Methodological Report.
Srinath, K. P., Michael P. Battaglia, and Meena Khare. 2004. “A Dual Frame Sampling Design
for an RDD Survey That Screens for a Rare Population.” Proceedings of the Survey
Research Methods Section of the American Statistical Association, 4424-29.
Swallen, Karen C., Dee W. West, Susan L. Stewart, Sally L. Glaser, and Pamela L. Horn-Ross.
1997. “Predictors of Misclassification of Hispanic Ethnicity in a Population-Based Cancer
Registry.” Annals of Epidemiology 7 (3): 200–206. doi:10.1016/S1047-2797(96)00154-8.
Tewksbury, Marcus, and Andy Roy. 2012. “The Experian Marketing Innovation Report 2012.”
Experian. http://www.experian.com/assets/marketing-services/reports/ems_2012_marketinginnovation_report.pdf.
Weaver, David A. 2000. “The Accuracy of Survey-Reported Marital Status: Evidence From
Survey Records Matched to Social Security Records.” Demography 37 (3): 395–99.
doi:10.2307/2648050.
West, Brady T. 2013. “An Examination of the Quality and Utility of Interviewer Observations in
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
33
the National Survey of Family Growth.” Journal of the Royal Statistical Society: Series A
(Statistics in Society) 176 (1): 211–25. doi:10.1111/j.1467-985X.2012.01038.x.
West, Brady T., and Roderick J. A. Little. 2013. “Non‐Response Adjustment of Survey
Estimates Based on Auxiliary Variables Subject to Error.” Journal of the Royal Statistical
Society. Series C (Applied Statistics) 62 (2): 213–31.
Winkler, William E. 2006. “Overview of Record Linkage and Current Research Directions.”
Statistics #2006-2. Statistical Research Division, U.S. Census Bureau, Research Report
Series.
Winship, Christopher, and Larry Radbill. 1994. “Sampling Weights and Regression Analysis.”
Sociological Methods & Research 23 (2): 230–57. doi:10.1177/0049124194023002004.
Word, David L, and R Colby Perkins Jr. 1996. “Building a Spanish Surname List for the 1990's:
a New Approach to an Old Problem.” Technical Working Paper No. 13. U. S. Bureau of the
Census. Washington, D.C.: U. S. Bureau of the Census.
Yancey, William E. 2010. “Expected Number of Random Duplications Within or Between Lists.”
Proceedings of the Section on Survey Research Methods, American Statistical Association.
2938-46.
Yeager, David S., Jon A. Krosnick, LinChiat Chang, Harold S. Javitz, Matthew S. Levendusky,
Alberto Simpser, and Rui Wang. 2011. “Comparing the Accuracy of RDD Telephone
Surveys and Internet Surveys Conducted with Probability and Non-Probability Samples.”
Public Opinion Quarterly 75 (4): 709–47. doi:10.1093/poq/nfr020.
Zambrana, Ruth E., and Olivia Carter-Pokras. 2001. “Health Data Issues for Hispanics:
Implications for Public Health Research.” Journal of Health Care for the Poor and
Underserved 12 (1): 20–34. doi:10.1353/hpu.2010.0547.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
34
Table 1. Unweighted Descriptive Characteristics of Six Demographic Self-Report and Ancillary
Variables and the Number of Cases Missing Data for Each Variable
Self-Report Data
Percent
Missing data
(count)
Ancillary Data
(Respondents only)
Missing data
Percent
(count)
1103
Ancillary Data
(All sampled cases)
Missing data
Percent
(count)
594
4456
Homeownership
Owner
69.9%
79.4%
76.9%
Non-Owner
30.1%
20.6%
23.1%
Income
< 15k
1531
326
2591
12.7%
3.8%
4.9%
15k – 25k
8.6%
6.6%
8.3%
25k – 35k
9.1%
9.3%
11.2%
35k – 50k
16.8%
15.6%
16.8%
50k – 75k
21.3%
23.5%
23.5%
75k – 100k
13.3%
17.4%
14.9%
100k – 150k
12.9%
16.7%
13.5%
5.2%
7.1%
6.9%
> 150k
8
167
2068
Household size
1 person
14.9%
24.2%
33.9%
2 persons
32.3%
26.9%
27.3%
3 persons
20.7%
19.8%
16.6%
4 persons
19.4%
13.3%
10.2%
5+ persons
12.7%
15.7%
12.0%
1811
986
6903
Marital Status
Married
55.6%
77.1%
71.1%
Not Married
44.4%
22.9%
28.9%
Education
Less than HS
30.2%
16.7%
24.6%
High School
27.6%
25.9%
23.0%
Some College
1811
910
5392
30.6%
29.5%
28.7%
Bachelors degree
8.8%
17.6%
14.9%
Post Grad/Professional
2.8%
10.2%
8.8%
8
Age
Age – mean
Total N
43.3 years
(sd=17.4y)
4472
1272
51.0 years
(sd=13.7y)
4472
8660
50.5 years
(sd=16.2y)
25000
Table 2. Variable Value Comparisons between Survey and Ancillary Data (% Respondents)
% Survey
% Survey
% Survey
% Far-off
< Ancillary = Ancillary > Ancillary
Total
cases
N
Homeownership
6.8%
89.0%
4.2%
100.0%
-1620
Household Income
51.1%
22.1%
26.8%
100.0%
43.1%
1524
Household Size
39.2%
27.8%
33.0%
100.0%
35.1%
2404
Marital Status
20.1%
73.1%
6.8%
100.0%
-1241
Education
53.2%
27.2%
19.6%
100.0%
24.7%
1275
Age
20.8%
67.0%
12.2%
100.0%
19.9%
1782
Corr.
(r)
.68
.48
.19
.41
.39
.73
Note: Far-off cases are counted when values from self-reported survey and ancillary data differ by more than one category (household income, household size,
and education) or more than five years (for ages). Far off cases were not computed for dichotomous variables. Ages within one year were considered equivalent.
N is the weighted overlap of non-missing cases between ancillary and self-report measures. All numbers are weighted by the best respondent match weight (see
Online Appendix F for alternatives)
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
36
Table 3. Logistic Regressions Predicting Missing Ancillary Data with Self-Reports
Missing
Homeownership
Homeowner
Missing Household
Income
-1.27
*** (.16)
Income - $15,000-24,999
Income - $25,000-34,999
Income - $35,000-49,999
Income - $50,000-74,999
Income - $75,000-99,999
Income - $100,000-149,999
Income - $150,000 or More
.10
.13
-.26
-.33
-.61
-.72
-.52
(.24)
(.25)
(.27)
(.25)
(.27)
(.37)
(.46)
.13
.53 +
.04
-.04
-.20
-.14
-.23
2 Persons in Household
3 Persons in Household
4 Persons in Household
5 or More Persons in Household
-.10
-.21
-.39
-.36
(.19)
(.21)
(.23)
(.25)
-.28
-.47 +
-.81 *
-1.70 ***
Married
-.25
(.17)
Education - High School Degree
Education - Some College
Education - College Degree
Education - Graduate Degree
.10
-.14
-.07
-.06
Age
Age2
-.03
.0001
Intercept
.48
McFadden's Pseudo R2/R2
-2 Log Likelihood
Percent Correctly Predicted (PCP)
Null Percent Correctly Predicted
N
*
+
+
Missing Marital
Status
.10
(.27)
-.50
*** (.12)
(.32)
(.30)
(.38)
(.36)
(.44)
(.48)
(.62)
-.64
-.01
.25
.16
-.12
.26
.36
(.70)
(.50)
(.47)
(.43)
(.52)
(.54)
(.67)
.03
.04
-.19
-.24
-.39
-.28
-.61
+
(.21)
(.20)
(.18)
(.18)
(.24)
(.29)
(.34)
(.24)
(.28)
(.33)
(.47)
-.18
.48
.22
-.77
(.35)
(.35)
(.41)
(.58)
-.18
-.32
-.46
-.86
(.14)
*
(.16)
*
(.19)
*** (.22)
-.22
(.25)
-.20
(.32)
-.49
(.19)
(.23)
(.29)
(.47)
.10
-.08
-.06
.22
(.26)
(.30)
(.43)
(.55)
.07
-.37
.17
.32
(.30)
(.38)
(.48)
(.72)
(.02)
(.0003)
-.01
-.0002
(.03)
(.0004)
-.05
.0005
(.04)
(.0004)
(.49)
-.86
(.69)
.17
1133.9
86.8%
86.9%
3199
-1.31 *** (.25)
Missing Household
Size
.17
585.1
94.1%
94.1%
3199
-2.31
** (.87)
.03
586.7
96.5%
96.5%
3199
Missing Education
Number of
Variables Missing
(.14)
-.99
*** (.12)
(.26)
(.28)
(.26)
(.22)
(.23)
(.25)
(.29)
.15
.15
-.04
-.27
-.42
-.36
-.10
(.22)
(.19)
(.23)
(.18)
(.23)
(.29)
(.27)
.04
.13
-.02
-.08
-.14 +
-.13
-.06
(.10)
(.08)
(.10)
(.07)
(.08)
(.10)
(.11)
.19
-.06
-.30
-.10
(.16)
(.19)
(.21)
(.22)
-.20
-.48
-.46
-.56
(.15)
(.17)
(.18)
(.20)
-.07
-.18 **
-.28 ***
-.38 ***
(.06)
(.07)
(.08)
(.08)
*** (.14)
.02
(.16)
.02
(.12)
-.10 +
(.05)
-.02
-.16
-.07
-.10
(.16)
(.15)
(.21)
(.40)
.02
-.11
.12
.08
(.15)
(.17)
(.22)
(.35)
-.42
-.32
-.09
-.27
(.13)
(.14)
(.18)
(.41)
-.05
-.12 *
-.002
-.02
(.06)
(.06)
(.10)
(.15)
-.03
.0004 *
(.02)
(.0002)
.01
.0000
(.02)
(.0002)
-.08 *** (.02)
.0004 *
(.0002)
-.03 *** (.01)
.0003*** (.0001)
2.75
2.47 *** (.17)
.15
(.39)
.06
1972.8
77.3%
77.4%
3199
-.04
Missing Age
.06
.33
.34
.34
.38 +
.21
.61 *
-1.94 *** (.46)
.01
1851.7
81.1%
81.1%
3199
+
**
*
**
**
*
*** (.40)
.13
1715.2
74.8%
71.5%
3199
-.53 *** (.06)
.12
3199
Note: Standard errors in parentheses. OLS was used to predict the number of missing variables for each individual (Column 7). Number of missing variables
ranged from 0 to 6. A negative binomial regression provided a poorer overall fit and was therefore not presented. All regressions were weighted using best
respondent match weights. Ns reflect total number of non-zero weighted cases because weights were set to a mean of one for this analysis. All numbers reflect
results after multiple imputation. + p<.10; *p < .05; **p < .01; ***p < .001 two-tailed.
Figure 1. Missing Ancillary Data by Variables and Respondents (Best Respondent Match
Weights)
b) Distribution of Missing Ancillary Data Across Respondents
(N = 2498)
30
50
a) Missing Ancillary Data by Variable
(N = 2498)
28.5
25
44.7
40
24.2
21.2
13.6
10
20
15
30
20
33.6
7.8
5
10
11.5
3.6
5.4
2.9
Home
Ownership
Household
Income
Household
Size
Marital
Status
.2
0
0
1.8
Education
Age
0
1
2
3
4
5
6
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
Figure 2. Missing Ancillary Data by Self-Reported Variable Values (Best Respondent Match
Weights)
38
Online Appendix A. Question Wordings for Household-Level Variables
Homeownership (Nominal; 2 categories)
Self-report. Respondents were asked: “Are your living quarters... Owned or being bought
by you or someone in your household, rented for cash, or occupied without payment of cash
rent.” Respondents who selected “Owned or being bought by you or someone in your
household” were coded 1, all other answers were coded 0.
Ancillary. Ancillary homeownership data were categorized 1 for homeowners and 0 for
non-owners.
Recoding. Survey respondents reported their homeownership in three categories, “owned
or being bought by you or someone in your household”, “rented for cash”, “occupied without
payment of cash rent”. To facilitate comparisons with the ancillary data, the self-reported
responses were recoded into two categories, “homeowners” and “non-owners”.
Household Income (Ordinal)
Self-report. Respondents were asked: “Was your total HOUSEHOLD income in the past
12 months ...” Respondents could choose: “Below $35,000”, “$35,000 or more”, or “Don’t
Know”. Respondents who selected “Below $35,000” were asked: “We would like to get a better
estimate of your total HOUSEHOLD income in the past 12 months before taxes. Was it ...”
Respondents could choose: “Less than $5,000”, “$5,000 to $7,499”, “$7,500 to $9,999”,
“$10,000 to $12,499”, “$12,500 to $14,999”, “$15,000 to $19,999”, “$20,000 to $24,999”,
“$25,000 to $29,999”, or “ $30,000 to $34,999”. Respondents who selected “$35,000 or more”
were asked: “We would like to get a better estimate of your total HOUSEHOLD income in the
past 12 months before taxes. Was it ...” Respondents could choose: “$35,000 to $39,999”,
“$40,000 to $49,999”, “$50,000 to $59,999”, “$60,000 to $74,999”, “$75,000 to $84,999”,
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
40
“$85,000 to $99,999”, “$100,000 to $124,999”, “$125,000 to $149,000”, “$150,000 to
$174,999”, or “$175,000 or more”. Responses to all three questions were recoded into eight
categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to
$49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “More than
$150,000”.
Ancillary. Ancillary income was coded into categories for “$1,000-$14,999”, “,
“$15,000-$24,999”, “$25,000-$34,999”, “$35,000-$49,999”, “$50,000-$74,999”, “$75,000$99,999”, “$100,000-$124,999”, “125,000-$149,999”, “$150,000-$174,999”, “175,000$199,999”, “$200,000-$249,999”, and “$250,000+”. Ancillary income data were recoded into
eight categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to
$49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “More than
$150,000”.
Recoding. Responses to household income questions in both sources were recoded into
eight categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to
$49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “$150,000
or more”.
Number of Persons in Household (Ordinal)
Self-report. Respondents were asked: “Including yourself, how many people currently
live in your household at least 50% of the time? Please remember to include babies or small
children, include unrelated individuals (such as roommates), and also include those now away
traveling, at school, or in a hospital.” Respondents could enter a number between 1 and 15.
Responses indicating more than 5 household members were collapsed into the single category:
“5 or more”.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
41
Ancillary. Data were requested on the number of adults in each household, the presence
of children in the household, and the number of children in the household. Households that were
not listed as having children present were coded 0 for the number of children (N=19,732). The
number of adults and children in the household were summed to produce a variable for the total
number of persons in the household. Sums indicating more than 5 individuals in the household
were collapsed into the single category: “5 or more”.
Recoding. Number of persons in the household was coded to range from 1 to 5 in both
datasets, with values greater than 5 recoded to equal 5.
Number of Children in Household (Ordinal)
Self-report. Respondents were asked: “Including yourself, how many people currently
live in your household at least 50% of the time? Please remember to include babies or small
children, include unrelated individuals (such as roommates), and also include those now away
traveling, at school, or in a hospital.” Respondents could enter a number between 1 and 15.
Responses indicating more than 5 household members were collapsed into the single category:
“5 or more”.
Ancillary. Data were requested on the number of adults in each household, the presence
of children in the household, and the number of children in the household. Households that were
not listed as having children present were coded 0 for the number of children (N=19,732). The
number of adults and children in the household were summed to produce a variable for the total
number of persons in the household. Sums indicating more than 5 individuals in the household
were collapsed into the single category: “5 or more”.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
42
Presence of Telephone (Nominal; 2 categories)
Self-report. Respondents were asked: “Is there at least one telephone INSIDE your home
that is currently working and is not a cell phone?” Respondents who selected “Yes” were coded
1, all other respondents were coded 0.
Ancillary. Phone number matches were requested for all households in the sample.
Phone numbers were matched for 11,881 households and could not be matched for 13,119
households. Matched households were coded 1, all other households were coded 0.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
43
Online Appendix B. Question Wordings for Individual-Level Variables
Marital Status (Nominal; 2 categories)
Self-report. Respondents were asked: “Are you now married, widowed, divorced,
separated, never married, or living with a partner?” Response options were “married”,
“widowed”, “divorced”, “separated”, “never married”, and “living with partner”. Responses
were recoded 1 for “married” and 0 for all others.
Ancillary. Data were requested on the marital status of an individual in each household.
Ancillary marital status data were categorized 1 for married and 0 for single.
Recoding. Responses to marital status from both data sources were coded as 1 for
respondents who reported that they were currently married and 0 for all other respondents.
Marital status in the ancillary data was reported for individuals who were classified as “heads of
household”. This category (as all others in the ancillary data) was not defined.
Education
Self-report. Respondents were asked: “What is the highest level of school you have
completed?” Response options were “no formal education”, “first, second, third, or fourth
grade”, “fifth or sixth grade”, “seventh or eighth grade”, “ninth grade”, “tenth grade”, “eleventh
grade”, “twelfth grade no diploma”, “high school diploma or the equivalent”, “some college no
degree”, “associate degree”, “bachelor degree”, “master degree”, and “professional or doctoral
degree”.
Ancillary. Data were requested on the education level of an individual in each household.
Ancillary education was coded into six categories for “less than high school diploma”, “high
school diploma”, “some college”, “bachelor”, “graduate school”, and “Don’t know”.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
44
Recoding. Education levels from both data sources were coded into 5 categories for “Less
than High School”, “High School Graduate”, “Some College”, “College Graduate” and “PostGraduate” education levels.
Age
Self-report. Respondents could enter their age in an open ended way.
Ancillary. Data were requested on an individual’s age in each household.
Recoding. Ages for all individuals were coded to range from 18 to 90 in analyses 1 and 2
and from 18 to 80 in analysis 3. To facilitate presentation, both data sources were also coded
into 7 categories for individuals aged 18-24, 25-34, 35-44, 45-54, 55-64, 65-74, and 75 and older.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
45
Online Appendix C. Weighting
Six sets of weights were produced to match the various data sources with the American
public and to one-another. These weights served two purposes. We adjusted for differences
between individual-level self-reports and household-level ancillary data and we corrected for the
stratified sampling procedure used by GfK to select households.
Household level weight
To produce data that could be compared across sources, we needed to match self-reports
to the values in ancillary data. The sampling procedure, however, allowed multiple individuals
from a single household to enter the panel. This introduced three potential problems. First, the
presence of multiple individuals from a single household introduced concerns about the
independence of observations. Second, the results of our analyses might be biased toward
households with multiple representatives. And third, it might be possible that ancillary data
could correctly match one individual in a household while providing an inaccurate portrait of
other individuals in the household. The first and second challenges are easily overcome by
weighting observations at the household, rather than individual, level. The third challenge is
more pernicious and requires that we consider the conditions under which household and
individual data should be considered in agreement. To circumvent these problems, we created
six sets of weights for respondents:
1. Pure household weight: (1) the weight was coded as the inverse of the number of
respondents in the household (Total unweighted N=4472; weighted N = 2498).
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
46
2. Household adult weight: (1) individuals under 18 were dropped; (2) the weight was
coded as the inverse of the number of adult respondents in the household. (Total unweighted N =
4134; weighted N = 2400).16
3. Best respondent match weight: (1) we collected the ages of all respondents in each
household; (2) the respondent closest to the age17 indicated by the household ancillary data was
selected and all other members of the household were dropped; (3) in households where no
respondents were close to the age indicated by the ancillary data or where multiple respondents
were equally close, individuals who were clearly not the best match were dropped and the weight
was coded as the inverse of the number of remaining individuals in the household (Total
unweighted N=3199; weighted N = 2498).
4. Best household match weight generous: (1) respondents were asked to provide the
names and ages of all individuals in the household; (2) in households where one individual was
closest to the age indicated by the household ancillary data, all other members of the household
were dropped (whether or not that individual was a respondent); (3) in households where
multiple individuals were equally close or ancillary age information was missing, the weight was
coded as the inverse of the number of remaining individuals in the household (Total unweighted
N=2010; weighted N = 1166).
5. Best household match weight strict: (1) respondents were asked to provide the names
and ages of all individuals in the household; (2) in households where one individual was closest
to the age indicated by the household ancillary data, all other members of the household were
98 Households only had respondents under age 18 and were thus dropped from the dataset when these
weights were used.
17 Age was used in these circumstances because it was the most commonly available piece of ancillary
information and was the only piece of ancillary information that could be consistently expected to
discriminate between members of a household. Other variables were either household-level or would be
expected to match multiple household members (e.g. marital status).
16
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
47
dropped (whether or not that individual was a respondent); (3) in households where multiple
individuals were equally close, the weight was coded as the inverse of the number of remaining
individuals in the household; (4) all individuals in households with more than one individual for
which ancillary age information was missing were dropped (Total unweighted N=920; weighted
N = 909).
6. Sole household match weight: (1) respondents were asked to provide the names and
ages of all individuals in the household; (2) in households where one and only one individual
was within 5 years of the age indicated by the household ancillary data, all other members of the
household were dropped (whether or not that individual was a respondent); (3) all individuals in
households with more than one member (as defined by the delineation of household members)
where none met this criterion were also dropped.18 An equal weight of 1 was applied for all
remaining households (Total unweighted N=672; weighted N = 672).
Probability of sampling correction
Probability of sampling corrections adjusted for deviation from an equal probability of
selection across strata. As part of a procedure to increase the number of respondents in
traditionally underrepresented groups, GfK used a stratified sampling technique. Households
were categorized into four groups depending on the age and Hispanic status indicated in the
ancillary data. Sampling probabilities were assigned to oversample households that included
Hispanics or individuals ages 18-24. Because our goal was to assess whether such techniques
might improve the survey process, we needed to eliminate any biases that might have been
introduced through this sampling procedure. To do so, we used two pieces of information: the
This left only individuals who were the sole member of their household within 5 years of the ancillary data
age and individuals who were in households with only one member regardless of the match between
ancillary and self-reported age. Although this was biasing on a dependent variable, it was a test of how
sensitive the conclusions were to the member of the household chosen.
18
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
48
proportion of households out of a random sample of one million with each of the ancillary
demographic characteristics considered (population) and the proportion of respondents in each
demographic category according to the ancillary data (respondents). Respondent-level weights
were calculated to match the characteristics of respondents to those of the population (weight =
population proportion / respondent proportion).
Best respondent match weights were multiplied by respondent-level weights for all
analyses presented. Alternate weighting strategies led to the same general conclusions and are
shown in Online Appendix F.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
49
Online Appendix D. Comparison of Self-Report and Ancillary Values
Table D1 – Crosstabs Comparing Self-Report and Ancillary Values
Home Ownership
Self-Report
Difference
Non-Owner
Owner
-1
6.78
Ancillary
Non-Owner
Owner
16.67
6.78
4.18
72.36
0
89.04
1
4.18
Less than $15k
1.55
.48
.65
.88
.37
.14
.00
.09
$15k-25k
2.41
.83
.90
1.01
.39
.11
.28
.17
$25k-35k
2.06
1.73
.29
2.00
1.46
.40
.51
.00
$35k-50k
2.54
1.47
1.97
3.35
2.61
2.02
.81
.26
$50k-75k
2.31
2.71
2.38
4.80
6.61
2.89
2.66
.16
$75k-100k
.72
1.00
1.18
2.50
4.72
3.65
3.05
.95
$100k-150k
.37
.41
.54
2.37
3.78
4.06
3.92
1.54
-7
.26
-6
.54
-5
1.39
-4
4.21
-3
10.32
-2
13.02
-1
21.36
0
22.12
1
13.46
2
8.75
3
2.64
4
1.25
5
.42
6
.17
7
.09
1
8.11
8.37
4.42
2.95
1.21
2
6.81
10.90
4.24
3.06
2.05
Ancillary
3
4.61
7.04
4.23
2.90
1.89
4
2.49
3.28
2.83
2.07
1.87
5 or more
2.46
4.71
2.00
2.99
2.50
-3
7.20
-2
9.89
-1
19.67
0
27.82
1
17.39
Household Income
Self-Report
Difference
Difference (cont.)
Ancillary
Less than $15k
$15k-25k
$25k-35k
$35k-50k
$50k-75k
$75k-100k
$100k-150k
More than $150k
Household Size
Self-Report
Difference
1
2
3
4
5 or more
-4
2.46
Marital Status
Self-Report
Difference
Non-Married
Married
-1
20.11
Self-Report
Difference
-4
.84
2
9.37
3
5.00
4
1.21
2
5.05
3
1.01
4
.07
Ancillary
Non-Married
Married
18.72
20.11
6.80
54.38
0
73.09
1
6.80
Less than HS
6.59
4.46
2.80
.88
.07
HS Grad
8.87
8.14
5.98
1.50
.13
Ancillary
Some College
6.44
11.64
9.71
1.88
.75
College Grad
2.24
3.54
11.01
2.06
1.12
Grad School
.84
.98
4.59
3.06
.73
-3
3.22
-2
14.56
-1
34.58
0
27.23
1
13.44
Education
Less than HS
HS Grad
Some College
College Grad
Grad School
More than $150k
.26
.17
.26
.36
1.53
.83
1.66
1.92
Matches Bolded. Numbers are percentages of pairwise complete wtd. N (see Table 2). All
values weighted using best respondent weights.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
50
Figure D1 – Comparisons of Self-Reported and Ancillary Age
50
40
20
30
Ancillary Data Age
60
70
80
Comparing Age in Ancillary Data with Self−Reported Measures
Wtd. N = 1782 Wtd. Cor = .73
20
30
40
50
60
70
80
Self−Reported Age
Points are jittered to show density. Dashed line indicates 5-year margin. Weighted using best
respondent match weights.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
51
Online Appendix E. Considerations for Using Consumer File Ancillary Data in Practice
Targeted data collection. The more accurate of the ancillary measures may be particularly
useful for oversampling individuals in hard-to-reach and low response rate groups. Researchers
adopting these strategies should carefully consider the bias-variance tradeoff of their choices,
however. An additional step in weighting can lead to increases in variance when oversampling
with the assistance of ancillary data (cf. Santos, 1991; Winship and Radbill, 1994). Also,
researchers employing sample designs with an over-sampled stratum defined by ancillary
information should always include an “all else” stratum to help mitigate bias and only establish
class eligibility based on the self-reported information and not the ancillary information.
Because ancillary data measures do not perfectly mirror self-reports, proper weighting of
an ancillary-generated oversample requires a multi-step process. First, researchers need to
stratify their sample into oversampled and non-oversampled groups based on the ancillary data.
Second, after sampling, data collected from the two samples need to be assigned a base weight
such that the ancillary variables match their pre-stratification proportion of the sampling frame.
This second step is necessary because some individuals in the target group (e.g. Hispanics) might
not be captured by the oversample (due to inaccuracies in the ancillary data). If these individuals
represent a systematic type of respondent, post-stratification without this adjustment would overrepresent members of the target group who were correctly identified in the ancillary data and
would under-represent members of the target group who were incorrectly identified in the
ancillary data. Third, post-hoc correctives should be applied to the base weights to produce a
sample that matches known population parameters (Deville et al., 1993).
Nonresponse adjustment. Effective non-response adjustment using consumer file data
depends on the extent to which that data can effectively discriminate between the individuals
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
52
who do and do not respond to the survey. Technically, the ability for auxiliary data to distinguish
between those who do and do not respond a survey is not dependent on the accuracy of the
ancillary data. Instead, it depends only on whether the auxiliary data can reliably distinguish
between respondents and nonrespondents and that the relations between auxiliary and self-report
data among respondents are identical to those among nonresponents. Directly assessing this
question is a job for future research, but the results here should concern those hoping to make
such a correction. Inconsistent correspondence between consumer file data and self-reports is
indicative of a variety of biases that could undermine corrective tools. Results indicating nonignorable missingness in particular imply that correctives may be differentially accurate
depending on the actual levels of particular variables.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
53
Online Appendix F. Analyses Using All Weighting Strategies
Table F1. Descriptive Characteristics of Respondents for Six Demographic Variables and the
Number of Cases Missing Data
Table F1a – Unweighted, with Probability of Sampling Correction Only (Respondent N=4472)
Self-Report Data
Percent
Home Owner
69.9%
Non-Owner
30.1%
Income < 15k
12.7%
Income 15k – 25k
Missing data
(count)
1103
Ancillary Data
(Respondents only)
Missing data
Percent
(count)
594
79.4%
20.6%
1531
76.9%
326
4.9%
8.3%
9.1%
9.3%
11.2%
16.8%
15.6%
16.8%
21.3%
23.5%
23.5%
Income 75k – 100k
13.3%
17.4%
14.9%
Income 100k – 150k
12.9%
16.7%
13.5%
5.2%
7.1%
6.9%
Income 35k – 50k
Income 50k – 75k
Income > 150k
Household size – 1
14.9%
Household size – 2
32.3%
26.9%
27.3%
Household size – 3
20.7%
19.8%
16.6%
Household size – 4
19.4%
13.3%
10.2%
Household size > 4
12.7%
15.7%
12.0%
Married
55.6%
Not Married
44.4%
Education - Less than HS
30.2%
Education – HS
27.6%
25.9%
23.0%
Education – Some College
30.6%
29.5%
28.7%
Education - Bachelors
8.8%
17.6%
14.9%
Education – Post Grad
2.8%
10.2%
8.8%
Age – mean
43.3 years
(sd=17.4y)
8
1811
24.2%
77.1%
167
986
22.9%
1811
8
16.7%
51.0 years
(sd=13.7y)
4456
23.1%
6.6%
Income 25k – 35k
8.6%
3.8%
Ancillary Data
(All sampled cases)
Missing data
Percent
(count)
33.9%
71.1%
2591
2068
6903
28.9%
910
1272
24.6%
50.5 years
(sd=16.2y)
5392
8660
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
54
Table F1b – Pure Household Weight (N=2498)
Self-Report Data
Percent
Home Owner
69.6%
Non-Owner
30.4%
Income < 15k
13.1%
Missing data
(count)
599
Ancillary Data
(Respondents only)
Missing data
Percent
(count)
325
79.4%
20.6%
850
3.8%
Ancillary Data
(All sampled cases)
Missing data
Percent
(count)
76.9%
23.1%
187
4.9%
Income 15k – 25k
9.3%
6.6%
8.3%
Income 25k – 35k
9.1%
9.3%
11.2%
Income 35k – 50k
17.2%
15.6%
16.8%
Income 50k – 75k
20.7%
23.5%
23.5%
Income 75k – 100k
13.5%
17.4%
14.9%
Income 100k – 150k
12.2%
16.7%
13.5%
4.9%
7.1%
6.9%
Income > 150k
Household size – 1
24.6%
Household size – 2
34.0%
26.9%
27.3%
Household size – 3
18.1%
19.8%
16.6%
Household size – 4
14.1%
13.3%
10.2%
Household size > 4
9.3%
15.7%
12.0%
4
24.2%
85
33.9%
Married
52.0%
Not Married
48.0%
Education - Less than HS
26.7%
Education – HS
27.9%
25.9%
23.0%
Education – Some College
32.3%
29.5%
28.7%
Education - Bachelors
10.2%
17.6%
14.9%
Education – Post Grad
3.0%
10.2%
8.8%
Age – mean
46.8 years
(sd=15.9y)
884
77.1%
585
22.9%
830
4
16.7%
51.9 years
(sd=14.1y)
4456
71.1%
2591
2068
6903
28.9%
510
680
24.6%
51.5 years
(sd=16.2y)
5392
8660
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
55
Table F1c –Household Adult Weight (N=2400)
Self-Report Data
Percent
Home Owner
69.9%
Non-Owner
30.1%
Income < 15k
13.0%
Missing data
(count)
1103
Ancillary Data
(Respondents only)
Missing data
Percent
(count)
594
79.3%
20.7%
1531
4.3%
Ancillary Data
(All sampled cases)
Missing data
Percent
(count)
76.9%
23.1%
326
4.9%
Income 15k – 25k
9.2%
6.6%
8.3%
Income 25k – 35k
9.1%
9.1%
11.2%
Income 35k – 50k
17.3%
15.3%
16.8%
Income 50k – 75k
20.5%
24.9%
23.5%
Income 75k – 100k
13.8%
16.5%
14.9%
Income 100k – 150k
12.3%
16.2%
13.5%
4.8%
7.1%
6.9%
Income > 150k
Household size – 1
25.5%
Household size – 2
34.8%
26.7%
27.3%
Household size – 3
17.6%
20.6%
16.6%
Household size – 4
13.3%
12.3%
10.2%
Household size > 4
8.8%
14.7%
12.0%
8
25.8%
33.9%
53.6%
Not Married
46.4%
Education - Less than HS
24.5%
Education – HS
28.7%
25.3%
23.0%
Education – Some College
33.2%
29.4%
28.7%
Education - Bachelors
10.4%
18.7%
14.9%
Education – Post Grad
3.1%
10.3%
8.8%
Age – mean
43.3 years
(sd=17.4y)
74.9%
167
Married
1811
986
25.1%
1811
8
16.2%
51.0 years
(sd=13.7y)
4456
71.1%
2591
2068
6903
28.9%
910
1272
24.6%
50.5 years
(sd=16.2y)
5392
8660
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
56
Table F1d – Best Respondent Match Weight (N=2498)
Self-Report Data
Percent
Home Owner
69.6%
Non-Owner
30.4%
Income < 15k
13.1%
Missing data
(count)
626
Ancillary Data
(Respondents only)
Missing data
Percent
(count)
339
79.2%
20.8%
850
4.3%
Ancillary Data
(All sampled cases)
Missing data
Percent
(count)
76.9%
23.1%
194
4.9%
Income 15k – 25k
9.3%
6.6%
8.3%
Income 25k – 35k
9.1%
9.2%
11.2%
Income 35k – 50k
17.2%
15.4%
16.8%
Income 50k – 75k
20.7%
24.7%
23.5%
Income 75k – 100k
13.5%
16.6%
14.9%
Income 100k – 150k
12.2%
16.1%
13.5%
4.9%
7.1%
6.9%
Income > 150k
Household size – 1
24.6%
Household size – 2
34.0%
27.0%
27.3%
Household size – 3
18.1%
20.6%
16.6%
Household size – 4
14.1%
12.6%
10.2%
Household size > 4
9.3%
14.6%
12.0%
4
25.1%
33.9%
53.6%
Not Married
46.4%
Education - Less than HS
24.7%
Education – HS
28.8%
25.5%
23.0%
Education – Some College
33.5%
29.3%
28.7%
Education - Bachelors
10.1%
18.7%
14.9%
Education – Post Grad
2.9%
10.1%
8.8%
Age – mean
47.4 years
(sd=15.7y)
75.1%
90
Married
866
604
24.9%
866
4
16.3%
51.7 years
(sd=14.0y)
4456
71.1%
2591
2068
6903
28.9%
531
712
24.6%
50.5 years
(sd=16.2y)
5392
8660
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
57
Table F1e – Best Household Match Weight Generous (N=1166)
Self-Report Data
Percent
Missing data
(count)
314
Ancillary Data
(Respondents only)
Missing data
Percent
(count)
280
68.2%
Ancillary Data
(All sampled cases)
Missing data
Percent
(count)
Home Owner
56.4%
Non-Owner
43.6%
Income < 15k
17.5%
Income 15k – 25k
10.4%
9.4%
8.3%
Income 25k – 35k
12.3%
11.0%
11.2%
Income 35k – 50k
17.2%
16.2%
16.8%
Income 50k – 75k
17.3%
22.3%
23.5%
Income 75k – 100k
11.1%
14.5%
14.9%
Income 100k – 150k
9.8%
13.1%
13.5%
Income > 150k
4.4%
6.0%
6.9%
31.8%
423
7.4%
76.9%
23.1%
190
4.9%
Household size – 1
29.0%
Household size – 2
33.7%
29.7%
27.3%
Household size – 3
16.3%
20.6%
16.6%
Household size – 4
13.3%
10.5%
10.2%
Household size > 4
7.7%
11.4%
12.0%
2
27.8%
33.9%
44.6%
Not Married
55.4%
Education - Less than HS
27.9%
Education – HS
27.3%
23.6%
23.0%
Education – Some College
30.6%
29.6%
28.7%
Education - Bachelors
11.2%
16.9%
14.9%
Education – Post Grad
2.9%
7.5%
8.8%
Age – mean
56.4%
44.1 years
(sd=16.4y)
61.4%
48
Married
426
379
38.6%
426
2
22.5%
52.0 years
(sd=14.7y)
4456
71.1%
2591
2068
6903
28.9%
309
688
24.6%
50.5 years
(sd=16.2y)
5392
8660
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
58
Table F1f – Best Household Match Weight Strict (N=909)
Self-Report Data
Percent
Home Owner
71.1%
Non-Owner
28.9%
Income < 15k
13.4%
Missing data
(count)
267
Ancillary Data
(Respondents only)
Missing data
Percent
(count)
108
79.3%
20.7%
323
4.8%
Ancillary Data
(All sampled cases)
Missing data
Percent
(count)
76.9%
23.1%
64
4.9%
Income 15k – 25k
7.4%
5.9%
8.3%
Income 25k – 35k
9.8%
8.8%
11.2%
Income 35k – 50k
17.7%
15.5%
16.8%
Income 50k – 75k
20.2%
23.5%
23.5%
Income 75k – 100k
14.0%
17.6%
14.9%
Income 100k – 150k
12.4%
16.0%
13.5%
5.0%
7.9%
6.9%
Income > 150k
Household size – 1
37.4%
Household size – 2
31.0%
26.2%
27.3%
Household size – 3
12.0%
21.4%
16.6%
Household size – 4
12.6%
12.4%
10.2%
Household size > 4
7.0%
14.9%
12.0%
5
25.1%
33.9%
51.6%
Not Married
48.4%
Education - Less than HS
18.8%
Education – HS
31.2%
25.4%
23.0%
Education – Some College
34.3%
30.9%
28.7%
Education - Bachelors
12.9%
20.1%
14.9%
Education – Post Grad
2.8%
10.3%
8.8%
49.7 years
(sd=15.2)
51.5 years
(sd=14.5y)
76.9%
50.5 years
(sd=16.2y)
Age – mean
74.8%
38
Married
374
189
25.2%
374
5
13.4%
4456
71.1%
2591
2068
6903
28.9%
184
198
24.6%
5392
8660
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
59
Table F1g – Sole Household Match Weight (N=672)
Self-Report Data
Percent
Home Owner
64.1%
Non-Owner
35.9%
Income < 15k
17.0%
Income 15k – 25k
Missing data
(count)
267
Ancillary Data
(Respondents only)
Missing data
Percent
(count)
100
76.2%
23.8%
292
9.3%
5.6%
Ancillary Data
(All sampled cases)
Missing data
Percent
(count)
76.9%
23.1%
64
7.3%
4.9%
11.4%
11.0%
11.2%
Income 35k – 50k
20.2%
16.8%
16.8%
Income 50k – 75k
17.8%
24.4%
23.5%
Income 75k – 100k
10.6%
14.6%
14.9%
Income 100k – 150k
10.1%
13.5%
13.5%
3.6%
6.9%
6.9%
Household size – 1
49.6%
Household size – 2
27.3%
26.8%
27.3%
Household size – 3
10.7%
21.5%
16.6%
Household size – 4
8.7%
11.6%
10.2%
Household size > 4
3.8%
12.4%
12.0%
5
27.7%
33.9%
27.1%
Not Married
72.9%
Education - Less than HS
18.8%
Education – HS
31.1%
25.4%
23.0%
Education – Some College
31.9%
30.1%
28.7%
Education - Bachelors
14.7%
19.7%
14.9%
Education – Post Grad
3.5%
10.7%
8.8%
Age – mean
50.3 years
(sd=15.5y)
68.5%
30
Married
376
175
31.5%
376
5
14.2%
52.9 years
(sd=14.5y)
2591
8.3%
Income 25k – 35k
Income > 150k
4456
71.1%
2068
6903
28.9%
138
196
24.6%
50.5 years
(sd=16.2y)
5392
8660
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
60
Table 2. Variable Value Comparisons between Survey and Ancillary Data (% of Respondents)
Table F2a – Unweighted, with Probability of Sampling Correction Only
% Survey
< Ancillary
% Survey
= Ancillary
% Survey
> Ancillary
Total
% Far-off
cases
n
Corr.
(r)
Homeownership
6.5%
Household Income
51.2%
89.5%
4.0%
22.0%
26.9%
100.0%
--
2925
.70
100.0%
42.9%
2734
.51
Household Size
33.6%
25.6%
40.8%
100.0%
36.6%
4297
.18
Marital Status
23.3%
Education
69.4%
7.2%
100.0%
--
2107
.32
Age*
55.4%
26.4%
18.2%
100.0%
27.1%
2104
.36
38.9%
46.0%
15.1%
100.0%
37.5%
3192
.53
% Survey
< Ancillary
% Survey
= Ancillary
% Survey
> Ancillary
Total
% Far-off
cases
n
Corr.
(r)
Homeownership
6.8%
89.0%
4.2%
100.0%
--
1621
.68
Household Income
51.1%
22.1%
26.8%
100.0%
43.1%
1524
.48
Household Size
39.1%
27.3%
33.6%
100.0%
35.4%
2406
.18
Marital Status
22.7%
70.7%
6.6%
100.0%
--
1230
.38
Education
53.4%
26.7%
19.9%
100.0%
26.7%
1261
.35
Age*
33.9%
50.4%
15.7%
100.0%
32.5%
1782
.58
% Survey
< Ancillary
% Survey
= Ancillary
% Survey
> Ancillary
Total
% Far-off
cases
n
Corr.
(r)
Homeownership
6.7%
89.0%
4.3%
100.0%
--
1559
.68
Household Income
50.8%
22.3%
26.9%
100.0%
42.9%
1465
.48
Household Size
39.8%
27.5%
32.8%
100.0%
35.1%
2311
.18
Marital Status
20.9%
72.3%
6.8%
100.0%
--
1193
.40
Education
52.5%
27.0%
20.4%
100.0%
26.1%
1229
.36
Age*
31.4%
52.4%
16.3%
100.0%
29.9%
1716
.60
Table F2b – Pure Household Weight
Table F2c –Household Adult Weight
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
61
Table F2d – Best Respondent Match Weight
% Survey
< Ancillary
% Survey
= Ancillary
% Survey
> Ancillary
Total
% Far-off
cases
n
Corr.
(r)
1620
.68
Homeownership
6.8%
89.0%
4.2%
100.0%
Household Income
51.1%
22.1%
26.8%
100.0%
43.1%
1524
.48
Household Size
39.2%
27.8%
33.0%
100.0%
35.1%
2404
.19
Marital Status
20.1%
73.1%
6.8%
100.0%
1241
.41
Education
53.2%
27.2%
19.6%
100.0%
24.7%
1275
.39
Age*
20.8%
67.0%
12.2%
100.0%
19.9%
1782
.73
--
--
Table F2e – Best Household Match Weight Generous
% Survey
< Ancillary
% Survey
= Ancillary
% Survey
> Ancillary
Total
% Far-off
cases
n
Corr.
(r)
Homeownership
7.3%
86.2%
6.5%
100.0%
--
651
.69
Household Income
48.5%
22.6%
28.9%
100.0%
43.4%
621
.53
Household Size
38.5%
27.2%
34.2%
100.0%
35.4%
1115
.14
Marital Status
19.8%
69.4%
10.8%
100.0%
--
499
.39
Education
46.4%
30.3%
23.4%
100.0%
23.9%
540
.39
Age*
8.1%
86.4%
5.5%
100.0%
9.4%
476
.83
Table F2f – Best Household Match Weight Strict
% Survey
< Ancillary
% Survey
= Ancillary
% Survey
> Ancillary
Total
% Far-off
cases
n
Corr.
(r)
Homeownership
4.7%
92.4%
2.9%
100.0%
--
573
.78
Household Income
47.9%
22.5%
29.6%
100.0%
42.7%
549
.49
Household Size
46.4%
30.3%
23.2%
100.0%
31.1%
866
.31
Marital Status
19.1%
75.6%
5.3%
100.0%
--
423
.49
Education
51.2%
28.6%
20.1%
100.0%
22.5%
422
.39
Age*
7.4%
87.5%
5.1%
100.0%
8.5%
707
.86
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
62
Table F2g – Sole Household Match Weight
% Survey
< Ancillary
% Survey
= Ancillary
% Survey
> Ancillary
Total
% Far-off
cases
n
Corr.
(r)
Homeownership
5.6%
90.0%
4.4%
100.0%
--
342
.74
Household Income
48.2%
21.4%
30.4%
100.0%
46.7%
344
.45
Household Size
52.0%
29.7%
18.3%
100.0%
33.7%
637
.23
Marital Status
26.6%
68.5%
4.9%
100.0%
--
198
.43
Education
48.4%
31.8%
19.8%
100.0%
21.8%
231
.42
Age*
4.6%
92.7%
2.7%
100.0%
3.3%
471
.94
Note: Far-off cases are counted when values from self-reported survey and ancillary data differ by more than one
category (household income, household size, and education) or more than five years (for ages). Far off cases were
not computed for dichotomous variables. Ages within one year were considered equivalent. N is the weighted
overlap of non-missing cases between ancillary and self-report measures.
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
63
Table 3. Logistic Regressions Predicting Missing Ancillary Data with Self-Reports
Table F3a – Unweighted, with Probability of Sampling Correction Only
Missing
Home
Ownership
-1.37***
(.11)
Missing
Household
Income
-1.24***
(.17)
Missing
Household
Size
-.10
(.26)
Missing
Marital
Status
-.54***
(.09)
-.06
(.17)
.03
(.17)
-.35*
(.16)
-.44+
(.23)
-.49*
(.22)
-.78**
(.24)
-.53
(.38)
.05
(.28)
.55*
(.25)
.01
(.26)
-.23
(.33)
-.13
(.30)
-.24
(.35)
-.10
(.45)
-.44
(.50)
.36
(.37)
.28
(.36)
.40
(.36)
.07
(.40)
.39
(.44)
.51
(.57)
-.13
(.16)
-.24
(.17)
-.34+
(.18)
-.17
(.19)
-.36+
(.21)
-.59*
(.23)
-.69**
(.24)
-1.65***
(.32)
Married
-.19
(.12)
Education - High School Degree
Self-Report Variable
Home Owner
Income - $15,000-24,999
Income - $25,000-34,999
Income - $35,000-49,999
Income - $50,000-74,999
Income - $75,000-99,999
Income - $100,000-149,999
Income - $150,000 or more
2 Persons in Household
3 Persons in Household
4 Persons in Household
5 Persons in Household
Education - Some College
Education - College Degree
Education - Graduate Degree
Age
Age
2
Intercept
McFadden's Pseudo R 2 / R 2
-2 Log Likelihoood
Percent Correctly Predicted (PCP)
Null Percent Correctly predicted
N
Missing
Education
Missing
Age
-.06
(.10)
-1.15***
(.08)
Number
of Missing
Variables
-.59***
(.04)
.11
(.17)
.05
(.17)
-.08
(.15)
-.23
(.15)
-.34+
(.18)
-.26
(.17)
-.45
(.31)
-.07
(.19)
.27
(.18)
.35+
(.19)
.29+
(.16)
.32+
(.19)
.22
(.19)
.59*
(.26)
.28+
(.16)
.30+
(.16)
.09
(.15)
-.15
(.14)
-.26
(.17)
-.45
(.25)
-.02
(.28)
.05
(.08)
.15+
(.08)
.01
(.07)
-.08
(.07)
-.11
(.07)
-.15+
(.08)
-.02
(.13)
-.22
(.31)
.57+
(.30)
.24
(.32)
-.63
(.40)
-.30*
(.12)
-.43**
(.13)
-.49***
(.14)
-.98***
(.16)
.15
(.14)
-.12
(.15)
-.34*
(.16)
-.27
(.17)
-.19
(.13)
-.46**
(.14)
-.26+
(.15)
-.59***
(.15)
-.11+
(.06)
-.22***
(.06)
-.25***
(.07)
-.41***
(.07)
-.10
(.20)
-.17
(.24)
-.42***
(.10)
.02
(.12)
.10
(.10)
-.06
(.04)
.04
(.14)
-.13
(.19)
-.11
(.23)
-.39
(.46)
.02
(.22)
.08
(.24)
-.11
(.39)
.01
(.46)
-.05
(.35)
-.18
(.33)
.002
(.40)
.25
(.48)
-.08
(.11)
-.02
(.13)
-.15
(.17)
-.13
(.29)
.01
(.13)
-.09
(.12)
.10
(.21)
.04
(.30)
-.34**
(.10)
-.22+
(.11)
-.03
(.22)
-.08
(.24)
-.06
(.05)
-.07
(.06)
-.02
(.10)
-.04
(.11)
.01
(.02)
-3e-04+
(2e-04)
.02
(.02)
-4e-04
(3e-04)
-.005
(.02)
1e-04
(3e-04)
-.01
(.01)
1e-04
(1e-04)
-.004
(.01)
1e-04
(1e-04)
.03*
(.01)
-.001***
(1e-04)
.002
(.01)
-1e-04
(1e-04)
-.57
(.35)
-1.70***
(.50)
-3.36***
(.62)
-.27
(.28)
-1.63***
(.31)
.23
(.29)
1.65***
(.13)
.11
3027.4
.87
.87
4472
.09
1674.4
.95
.95
4472
.03
1328.2
.96
.96
4472
.05
4311.0
.80
.80
4472
.01
4149.6
.82
.82
4472
.11
4778.7
.74
.71
4472
.09
---4472
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
64
Table F3b – Pure Household Weight
Missing
Home
Ownership
Missing
Household
Income
Missing
Household
Size
Missing
Marital
Status
Missing
Education
Missing
Age
-1.30***
(.15)
-1.30***
(.22)
-.01
(.29)
-.53***
(.12)
-.02
(.13)
-1.07***
(.11)
Number
of Missing
Variables
(OLS)
-.57***
(.05)
.07
(.22)
.10
(.24)
-.36
(.24)
-.36
(.27)
-.57*
(.29)
-.70*
(.34)
-.27
(.39)
.08
(.33)
.52
(.34)
.06
(.32)
-.16
(.35)
-.12
(.46)
-.13
(.45)
.12
(.60)
-.56
(.74)
.18
(.51)
.33
(.47)
.27
(.51)
-.02
(.54)
.32
(.55)
.34
(.69)
.02
(.22)
-.08
(.21)
-.20
(.20)
-.22
(.18)
-.39+
(.21)
-.27
(.26)
-.54
(.37)
.01
(.26)
.36
(.28)
.31
(.21)
.28
(.20)
.28
(.25)
.08
(.23)
.65*
(.28)
.06
(.20)
.16
(.22)
-.14
(.23)
-.35
(.22)
-.43*
(.20)
-.49*
(.24)
-.20
(.27)
.01
(.08)
.11
(.08)
-.06
(.09)
-.11
(.09)
-.17+
(.09)
-.18+
(.10)
-.04
(.12)
-.14
(.19)
-.27
(.21)
-.45+
(.24)
-.42
(.26)
-.26
(.25)
-.46
(.28)
-.78*
(.34)
-1.64***
(.47)
-.34
(.35)
.39
(.36)
-.10
(.42)
-1.04+
(.60)
-.16
(.14)
-.32+
(.17)
-.48*
(.19)
-.87***
(.22)
.22
(.16)
-.005
(.18)
-.25
(.21)
-.04
(.22)
-.19
(.15)
-.46**
(.17)
-.45*
(.18)
-.52**
(.19)
-.08
(.05)
-.19**
(.06)
-.29***
(.07)
-.38***
(.07)
Married
-.22
(.17)
-.29
(.26)
-.03
(.29)
-.54***
(.13)
-.03
(.14)
.04
(.12)
-.11*
(.04)
Education - High School Degree
.01
(.18)
-.28
(.19)
-.19
(.29)
-.26
(.45)
.15
(.26)
-.01
(.26)
-.02
(.37)
.32
(.50)
.06
(.35)
-.21
(.36)
.10
(.51)
.26
(.78)
-.03
(.14)
-.12
(.15)
-.19
(.24)
-.03
(.34)
.12
(.14)
-.01
(.14)
.20
(.20)
.11
(.33)
-.36*
(.15)
-.26+
(.13)
-.03
(.22)
-.15
(.35)
-.03
(.05)
-.09+
(.05)
-.01
(.08)
.002
(.12)
.004
(.02)
-3e-04
(2e-04)
.02
(.03)
-.001
(4e-04)
-.01
(.04)
1e-04
(4e-04)
-.01
(.02)
2e-04
(2e-04)
.001
(.02)
4e-05
(2e-04)
.01
(.02)
-4e-04*
(2e-04)
-.01
(.01)
.00000
(1e-04)
Intercept
-.20
(.47)
-1.66*
(.68)
-3.09***
(.85)
-.08
(.38)
-1.90***
(.41)
.91*
(.38)
1.88***
(.13)
McFadden's Pseudo R 2 / R 2
-2 Log Likelihoood
Percent Correctly Predicted (PCP)
Null Percent Correctly predicted
N
.14
851.2
.87
.87
4472
.14
524.4
.94
.94
4472
.04
331.9
.97
.97
4472
.06
1338.7
.77
.77
4472
.01
1197.6
.81
.81
4472
.13
1261.7
.74
.71
4472
.10
---4472
Self-Report Variable
Home Owner
Income - $15,000-24,999
Income - $25,000-34,999
Income - $35,000-49,999
Income - $50,000-74,999
Income - $75,000-99,999
Income - $100,000-149,999
Income - $150,000 or more
2 Persons in Household
3 Persons in Household
4 Persons in Household
5 Persons in Household
Education - Some College
Education - College Degree
Education - Graduate Degree
Age
Age 2
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
65
Table F3c –Household Adult Weight
Missing
Home
Ownership
Missing
Household
Income
Missing
Household
Size
Missing
Marital
Status
Missing
Education
Missing
Age
-1.24***
(.15)
-1.27***
(.23)
.07
(.28)
-.46***
(.13)
.04
(.14)
-1.05***
(.11)
Number
of Missing
Variables
(OLS)
-.53***
(.05)
.14
(.23)
.17
(.25)
-.28
(.26)
-.28
(.28)
-.52+
(.29)
-.88*
(.37)
-.68
(.53)
.06
(.33)
.49
(.33)
-.01
(.37)
-.11
(.38)
-.15
(.41)
-.36
(.46)
-.60
(.70)
-.45
(.68)
.06
(.56)
.39
(.49)
.39
(.42)
-.05
(.57)
.39
(.48)
.44
(.62)
.11
(.20)
.09
(.20)
-.11
(.20)
-.13
(.21)
-.33
(.23)
-.33
(.27)
-.64+
(.36)
.05
(.24)
.21
(.23)
.25
(.23)
.09
(.23)
.16
(.25)
-.04
(.25)
.48
(.38)
.06
(.22)
.20
(.19)
-.12
(.20)
-.33
(.20)
-.44+
(.22)
-.47+
(.24)
-.20
(.28)
.04
(.08)
.14+
(.08)
-.04
(.08)
-.10
(.09)
-.17+
(.10)
-.21*
(.10)
-.11
(.10)
-.14
(.18)
-.24
(.21)
-.42+
(.24)
-.40
(.26)
-.29
(.24)
-.47+
(.28)
-.87*
(.34)
-1.66***
(.49)
-.31
(.35)
.46
(.35)
-.09
(.43)
-1.06+
(.62)
-.17
(.14)
-.28+
(.17)
-.50**
(.19)
-.85***
(.23)
.18
(.16)
-.04
(.19)
-.31
(.22)
-.11
(.23)
-.19
(.14)
-.43*
(.17)
-.45*
(.18)
-.42*
(.20)
-.08
(.05)
-.17**
(.06)
-.30***
(.07)
-.37***
(.08)
Married
-.24
(.16)
-.24
(.25)
-.13
(.28)
-.55***
(.14)
.03
(.15)
-.01
(.13)
-.11*
(.05)
Education - High School Degree
-.02
(.17)
-.23
(.22)
-.15
(.27)
.02
(.40)
.05
(.27)
.001
(.31)
-.01
(.38)
.42
(.59)
.02
(.32)
-.29
(.36)
.08
(.48)
.27
(.68)
-.07
(.14)
-.13
(.14)
-.16
(.21)
-.08
(.34)
.12
(.17)
-.004
(.17)
.24
(.21)
.13
(.35)
-.34*
(.16)
-.25
(.15)
-.02
(.20)
-.06
(.36)
-.05
(.05)
-.09
(.06)
.004
(.08)
.05
(.14)
-.02
(.02)
2e-05
(3e-04)
.01
(.04)
-3e-04
(4e-04)
-.03
(.04)
2e-04
(4e-04)
-.03+
(.02)
4e-04*
(2e-04)
.01
(.02)
-1e-05
(2e-04)
-.01
(.02)
-3e-04
(2e-04)
-.02*
(.01)
1e-04
(1e-04)
Intercept
.34
(.53)
-1.22
(.78)
-2.76**
(.97)
.30
(.42)
-1.96***
(.47)
1.18**
(.43)
2.10***
(.15)
McFadden's Pseudo R 2 / R 2
-2 Log Likelihoood
Percent Correctly Predicted (PCP)
Null Percent Correctly predicted
N
.15
848.2
.87
.87
4134
.15
519.5
.94
.94
4134
.03
335.5
.97
.97
4134
.06
1345.2
.77
.77
4134
.01
1202.4
.81
.81
4134
.14
1268.1
.74
.72
4134
.10
---4134
Self-Report Variable
Home Owner
Income - $15,000-24,999
Income - $25,000-34,999
Income - $35,000-49,999
Income - $50,000-74,999
Income - $75,000-99,999
Income - $100,000-149,999
Income - $150,000 or more
2 Persons in Household
3 Persons in Household
4 Persons in Household
5 Persons in Household
Education - Some College
Education - College Degree
Education - Graduate Degree
Age
Age
2
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
66
Table F3d – Best Respondent Match Weight
Missing
Education
Missing
Age
-.02
(.15)
-.96***
(.11)
Number
of Missing
Variables
-.52***
(.05)
.004
(.19)
-.11
(.21)
-.24
(.20)
-.30+
(.18)
-.44+
(.23)
-.38
(.23)
-.51+
(.30)
.10
(.31)
.39
(.32)
.42+
(.21)
.29
(.24)
.36
(.32)
.27
(.28)
.64+
(.34)
.13
(.19)
.17
(.23)
-.09
(.21)
-.32+
(.18)
-.48*
(.22)
-.51*
(.23)
-.19
(.28)
.03
(.10)
.10
(.10)
-.05
(.08)
-.14+
(.08)
-.18*
(.09)
-.18*
(.09)
-.08
(.12)
-.28
(.35)
.38
(.36)
.10
(.42)
-.89
(.58)
-.14
(.14)
-.30+
(.17)
-.43*
(.19)
-.82***
(.23)
.21
(.16)
-.04
(.18)
-.27
(.21)
-.07
(.22)
-.19
(.15)
-.48**
(.17)
-.46*
(.18)
-.56**
(.20)
-.07
(.06)
-.18*
(.07)
-.28***
(.08)
-.38***
(.08)
-.20
(.25)
.02
(.30)
-.56***
(.15)
-.01
(.13)
.04
(.13)
-.10+
(.05)
.05
(.20)
-.17
(.18)
-.05
(.31)
-.07
(.48)
.08
(.27)
-.05
(.28)
.10
(.36)
.34
(.56)
-.03
(.36)
-.30
(.37)
.17
(.43)
.04
(.92)
-.07
(.14)
-.15
(.14)
-.12
(.21)
-.09
(.35)
.06
(.17)
-.06
(.18)
.16
(.20)
.12
(.33)
-.44**
(.14)
-.34*
(.14)
-.12
(.22)
-.04
(.32)
-.06
(.06)
-.11+
(.06)
-.003
(.09)
.02
(.15)
-.03
(.02)
.001
(.003)
-.01
(.03)
-.001
(.004)
-.05
(.04)
.001
(.004)
-.03
(.02)
.004*
(.002)
.005
(.02)
.0000
(.002)
-.08***
(.02)
.004*
(.002)
-.03***
(.01)
.003***
(.001)
.57
(.49)
-.78
(.69)
-2.08*
(.88)
.19
(.40)
-1.99***
(.45)
2.77***
(.40)
2.49***
(.17)
.16
1137.1
.87
.87
3199
.16
588.1
.94
.94
3199
.03
585.9
.96
.96
3199
.06
1970.3
.77
.77
3199
.02
1850.9
.81
.81
3199
.13
1712.8
.75
.71
3199
.12
---3199
Missing
Home
Ownership
-1.25***
(.15)
Missing
Household
Income
-1.28***
(.22)
Missing
Household
Size
-.04
(.33)
Missing
Marital
Status
-.47***
(.12)
.13
(.26)
.10
(.25)
-.34
(.25)
-.44+
(.26)
-.55+
(.30)
-.77+
(.39)
-.62
(.48)
.07
(.35)
.52
(.33)
.02
(.31)
-.22
(.38)
-.20
(.38)
-.31
(.64)
-.18
(.69)
-.64
(.70)
-.10
(.57)
.17
(.45)
.02
(.53)
-.27
(.74)
.11
(.65)
.22
(.84)
-.10
(.19)
-.24
(.21)
-.41+
(.23)
-.38
(.25)
-.28
(.24)
-.48+
(.28)
-.81*
(.33)
-1.68***
(.47)
Married
-.22
(.18)
Education - High School Degree
Self-Report Variable
Home Owner
Income - $15,000-24,999
Income - $25,000-34,999
Income - $35,000-49,999
Income - $50,000-74,999
Income - $75,000-99,999
Income - $100,000-149,999
Income - $150,000 or more
2 Persons in Household
3 Persons in Household
4 Persons in Household
5 Persons in Household
Education - Some College
Education - College Degree
Education - Graduate Degree
Age
Age
2
Intercept
McFadden's Pseudo R 2 / R 2
-2 Log Likelihoood
Percent Correctly Predicted (PCP)
Null Percent Correctly predicted
N
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
67
Table F3e – Best Household Match Weight Generous
Missing
Home
Ownership
Missing
Household
Income
Missing
Household
Size
Missing
Marital
Status
Missing
Education
Missing
Age
-1.00***
(.21)
-1.03***
(.22)
.33
(.35)
-.51***
(.15)
.05
(.16)
-.99***
(.20)
Number
of Missing
Variables
(OLS)
-.55***
(.09)
.19
(.26)
.17
(.26)
-.16
(.28)
-.11
(.29)
-.24
(.33)
-.53
(.42)
-.16
(.50)
.04
(.38)
.45
(.30)
-.01
(.36)
.14
(.33)
.17
(.42)
-.15
(.50)
.01
(.65)
-.07
(.93)
.33
(.67)
.66
(.85)
.09
(1.03)
-.06
(.87)
-.10
(.86)
.47
(.88)
.19
(.26)
.16
(.29)
-.11
(.23)
-.01
(.30)
-.19
(.30)
-.24
(.34)
-.19
(.57)
-.08
(.35)
.36
(.28)
.65*
(.28)
.58*
(.27)
.55+
(.32)
.36
(.35)
.92*
(.44)
.49
(.32)
.30
(.34)
-.04
(.29)
-.13
(.27)
-.38
(.30)
-.38
(.31)
-.05
(.46)
.15
(.15)
.24*
(.12)
.09
(.11)
.06
(.13)
-.04
(.14)
-.14
(.15)
.14
(.22)
-.10
(.22)
-.11
(.25)
-.47+
(.28)
-.43
(.32)
-.20
(.26)
-.19
(.29)
-.63+
(.34)
-1.29**
(.50)
-.45
(.43)
-.10
(.47)
-.14
(.52)
-1.99+
(1.08)
-.18
(.19)
-.35
(.23)
-.43+
(.25)
-.92**
(.31)
.11
(.20)
.15
(.24)
-.15
(.27)
-.08
(.30)
-.02
(.20)
.07
(.23)
-.11
(.25)
.18
(.29)
-.08
(.10)
-.08
(.11)
-.29*
(.12)
-.37**
(.14)
Married
-.19
(.21)
-.24
(.27)
-.20
(.41)
-.17
(.19)
.04
(.19)
-.02
(.17)
-.08
(.09)
Education - High School Degree
.14
(.22)
-.24
(.23)
-.16
(.32)
.30
(.44)
.19
(.28)
.08
(.27)
.03
(.44)
.57
(.52)
.12
(.47)
-.37
(.49)
.30
(.55)
-.30
(1.25)
-.12
(.19)
-.30
(.21)
-.45
(.30)
-.03
(.52)
-.06
(.21)
-.19
(.22)
-.08
(.33)
-.26
(.53)
-.47*
(.21)
-.37+
(.21)
-.19
(.28)
.10
(.44)
-.06
(.12)
-.19+
(.10)
-.13
(.14)
.09
(.21)
-.02
(.03)
4e-05
(3e-04)
-.003
(.03)
-2e-04
(4e-04)
-.02
(.05)
1e-04
(.001)
-.02
(.02)
3e-04
(2e-04)
.003
(.02)
2e-05
(2e-04)
-.11***
(.03)
.001**
(3e-04)
-.03**
(.01)
2e-04
(1e-04)
Intercept
.46
(.56)
-.87
(.69)
-2.49*
(1.21)
.28
(.50)
-1.75**
(.55)
4.57***
(.62)
2.87***
(.23)
McFadden's Pseudo R 2 / R 2
-2 Log Likelihoood
Percent Correctly Predicted (PCP)
Null Percent Correctly predicted
N
.11
584.2
.78
.77
2010
.10
431.8
.87
.87
2010
.08
172.8
.96
.96
2010
.05
714.8
.71
.72
2010
.02
651.4
.76
.76
2010
.16
685.4
.72
.61
2010
.11
---2010
Self-Report Variable
Home Owner
Income - $15,000-24,999
Income - $25,000-34,999
Income - $35,000-49,999
Income - $50,000-74,999
Income - $75,000-99,999
Income - $100,000-149,999
Income - $150,000 or more
2 Persons in Household
3 Persons in Household
4 Persons in Household
5 Persons in Household
Education - Some College
Education - College Degree
Education - Graduate Degree
Age
Age 2
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
68
Table F3f – Best Household Match Weight Strict
Missing
Home
Ownership
Missing
Household
Income
Missing
Household
Size
Missing
Marital
Status
Missing
Education
Missing
Age
Home Owner
-1.16**
(.33)
-.93+
(.49)
.18
(.43)
-.29
(.24)
.10
(.27)
-.95**
(.34)
Number
of Missing
Variables
(OLS)
-.31**
(.09)
Income - $15,000-24,999
.08
(.51)
-.23
(.51)
-.17
(.48)
-.29
(.43)
-.60
(.91)
-.68
(.57)
-3.14
(361.63)
-.13
(.70)
-.10
(.61)
-.29
(.54)
.08
(.67)
-.66
(1.12)
-.67
(.82)
-3.81
(1541.83)
-.55
(.86)
-.44
(.94)
.36
(.62)
-.82
(1.10)
.01
(.76)
-.57
(.94)
-.19
(1.38)
.07
(.37)
.02
(.35)
-.35
(.38)
-.08
(.31)
-.30
(.43)
-.09
(.52)
-.40
(.94)
-.06
(.52)
.07
(.53)
.50
(.44)
.41
(.47)
.53
(.47)
.10
(.53)
.63
(.60)
-.05
(.56)
-.41
(.53)
-.61
(.46)
-.39
(.62)
-.80
(.69)
-.67
(.57)
-.98
(.98)
.01
(.17)
-.06
(.18)
-.06
(.14)
-.05
(.17)
-.08
(.19)
-.14
(.15)
-.03
(.26)
-1.61***
(.43)
-1.84***
(.48)
-2.26***
(.63)
-.76
(.49)
-18.76
(1347.02)
-19.16
(1986.46)
-19.10
(2046.11)
-19.16
(2465.22)
-.38
(.50)
.51
(.48)
-.01
(.60)
-.72
(.83)
-1.08***
(.26)
-.89**
(.31)
-1.31***
(.38)
-1.81***
(.51)
-.28
(.25)
-.29
(.30)
-1.66***
(.46)
-.92*
(.43)
-20.80
(996.14)
-21.29
(1484.59)
-21.21
(1509.35)
-21.49
(1818.57)
-1.05***
(.10)
-1.09***
(.11)
-1.29***
(.12)
-1.24***
(.14)
Married
-.40
(.42)
-.004
(.62)
-.40
(.52)
-.53+
(.28)
.08
(.26)
-.15
(.52)
-.05
(.10)
Education - High School Degree
.41
(.39)
-.31
(.46)
-.11
(.49)
.23
(.90)
.12
(.54)
-.13
(.61)
-.05
(.73)
.58
(.91)
.06
(.54)
-.40
(.65)
-.35
(.88)
-5.76
(511.67)
-.12
(.31)
-.47
(.34)
-.62+
(.37)
-.14
(.64)
.19
(.32)
.15
(.27)
.24
(.36)
.20
(.70)
-.58
(.52)
-.36
(.50)
.08
(.53)
-.22
(.95)
.02
(.11)
-.10
(.10)
-.04
(.14)
.05
(.22)
-.06
(.04)
3e-04
(5e-04)
.07
(.07)
-.001
(.001)
-.10+
(.05)
.001+
(.001)
-.03
(.03)
4e-04
(3e-04)
.09**
(.04)
-.001*
(4e-04)
-.12*
(.05)
.001
(.001)
-.01
(.01)
4e-05
(1e-04)
Intercept
1.74+
(.99)
-2.09
(1.52)
-.27
(1.29)
.45
(.78)
-4.26***
(.95)
5.48***
(1.41)
2.20***
(.30)
McFadden's Pseudo R 2 / R 2
-2 Log Likelihoood
Percent Correctly Predicted (PCP)
Null Percent Correctly predicted
N
.27
480.7
.89
.88
920
.35
254.2
.94
.94
920
.07
303.6
.96
.96
920
.12
781.4
.80
.80
920
.06
806.1
.81
.81
920
.61
355.5
.90
.80
920
.32
---920
Self-Report Variable
Income - $25,000-34,999
Income - $35,000-49,999
Income - $50,000-74,999
Income - $75,000-99,999
Income - $100,000-149,999
Income - $150,000 or more
2 Persons in Household
3 Persons in Household
4 Persons in Household
5 Persons in Household
Education - Some College
Education - College Degree
Education - Graduate Degree
Age
Age
2
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
69
Table F3g – Sole Household Match Weight
Missing
Home
Ownership
Missing
Household
Income
Missing
Household
Size
Missing
Marital
Status
Missing
Education
Missing
Age
-1.22***
(.32)
-.99*
(.43)
-.003
(.62)
-.31
(.28)
.02
(.31)
-1.04***
(.29)
Number
of Missing
Variables
(OLS)
-.40**
(.12)
.18
(.56)
.08
(.53)
-.23
(.52)
.17
(.59)
-.31
(.80)
-.54
(.89)
.11
(1.37)
-.10
(.65)
.15
(.63)
-.24
(.62)
.24
(.71)
-.48
(1.32)
-.14
(1.00)
-.04
(1.39)
-.98
(1.08)
-.20
(.92)
.01
(.91)
-.63
(1.06)
-3.57
(347.76)
-3.96
(579.55)
-12.38
(1162.99)
-.02
(.40)
.02
(.43)
-.38
(.41)
.11
(.43)
-.18
(.44)
-.16
(.41)
.19
(.78)
.26
(.53)
.39
(.59)
.73
(.44)
.74+
(.42)
.62
(.54)
.24
(.58)
.65
(.65)
.28
(.61)
-.22
(.57)
-.52
(.49)
-.34
(.52)
-.83
(.64)
-.47
(.64)
-.78
(1.61)
.09
(.23)
.07
(.18)
-.04
(.18)
.08
(.19)
-.07
(.19)
-.10
(.18)
.04
(.32)
-1.73***
(.52)
-1.98***
(.57)
-2.66*
(1.08)
-.13
(.61)
-18.31
(1282.61)
-18.88
(1847.27)
-18.66
(2196.71)
-18.58
(2820.00)
-.70
(.64)
.30
(.54)
-.08
(.75)
-.19
(.86)
-.90**
(.27)
-.97**
(.33)
-1.65**
(.51)
-1.63*
(.65)
-.35
(.29)
-.31
(.34)
-1.57**
(.58)
-1.64*
(.77)
-20.72
(1243.63)
-21.32
(1789.17)
-21.11
(2099.07)
-21.51
(2628.75)
-1.00***
(.11)
-1.11***
(.13)
-1.29***
(.16)
-1.15***
(.19)
Married
-.60
(.59)
-.19
(.63)
.35
(.56)
-.34
(.30)
.12
(.34)
-.07
(.44)
-.04
(.12)
Education - High School Degree
.44
(.39)
-.40
(.41)
-.23
(.50)
.52
(.88)
.14
(.50)
-.38
(.67)
-.10
(.67)
.95
(.93)
.53
(.91)
.02
(.82)
.42
(1.13)
-2.74
(839.54)
-.03
(.32)
-.40
(.40)
-.37
(.39)
.39
(.86)
.37
(.37)
.21
(.42)
.11
(.46)
.29
(.75)
-.27
(.46)
-.37
(.44)
-.01
(.64)
.18
(.80)
.10
(.14)
-.12
(.15)
-.05
(.20)
.31
(.33)
-.04
(.05)
2e-05
(.001)
.07
(.07)
-.001
(.001)
-.15*
(.06)
.001*
(.001)
-.03
(.03)
4e-04
(3e-04)
.09*
(.04)
-.001+
(4e-04)
-.12*
(.05)
.001
(.001)
-.02
(.01)
1e-04
(1e-04)
Intercept
1.28
(1.08)
-2.14
(1.50)
.63
(1.40)
.62
(.82)
-4.64***
(1.10)
5.27***
(1.40)
2.43***
(.36)
McFadden's Pseudo R 2 / R 2
-2 Log Likelihoood
Percent Correctly Predicted (PCP)
Null Percent Correctly predicted
N
.27
404.5
.87
.86
672
.30
254.5
.92
.92
672
.11
224.2
.95
.95
672
.09
674.3
.76
.76
672
.07
605.3
.81
.81
672
.55
355.8
.87
.72
672
.32
---672
Self-Report Variable
Home Owner
Income - $15,000-24,999
Income - $25,000-34,999
Income - $35,000-49,999
Income - $50,000-74,999
Income - $75,000-99,999
Income - $100,000-149,999
Income - $150,000 or more
2 Persons in Household
3 Persons in Household
4 Persons in Household
5 Persons in Household
Education - Some College
Education - College Degree
Education - Graduate Degree
Age
Age
2
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
70
Figure F1. Missing Ancillary Data by Variables and Respondents
Figure F1a – Unweighted, with Probability of Sampling Correction Only
50
b) Distribution of Missing Ancillary Data Across Respondents
(N = 4472)
28.4
13.3
30
32.7
20
Proportion of Respondents (%)
15
20
20.3
46.4
40
46.4
25
22.0
10
7.3
11.3
5
10
Proportion Missing Ancillary Data (%)
30
a) Missing Ancillary Data by Variable
(N = 4472)
3.7
5.0
2.7
Home
Ownership
Household
Income
Household
Size
Marital
Status
.2
0
0
1.7
Education
Age
0
1
Variable
2
3
4
5
6
Number of Variables Missing
Figure F1b – Pure Household Weight
b) Distribution of Missing Ancillary Data Across Respondents
(N = 2498)
50
30
a) Missing Ancillary Data by Variable
(N = 2498)
28.5
40
33.8
30
Proportion of Respondents (%)
44.6
20
25
13.6
10
15
20
21.2
7.8
5
10
11.4
3.5
5.3
2.9
1.9
Home
Ownership
Household
Income
Household
Size
Variable
Marital
Status
.2
0
0
Proportion Missing Ancillary Data (%)
44.6
24.2
Education
Age
0
1
2
3
4
Number of Variables Missing
5
6
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
71
Figure F1c –Household Adult Weight
b) Distribution of Missing Ancillary Data Across Respondents
(N = 2400)
50
30
a) Missing Ancillary Data by Variable
(N = 2400)
28.3
13.6
40
44.5
30
33.8
20
Proportion of Respondents (%)
25
15
20
21.3
10
7.8
11.4
5
10
Proportion Missing Ancillary Data (%)
44.5
24.4
3.6
5.3
2.8
Home
Ownership
Household
Income
Household
Size
Marital
Status
.2
0
0
1.9
Education
Age
0
1
Variable
2
3
4
5
6
Number of Variables Missing
Figure F1d – Best Respondent Match Weight
b) Distribution of Missing Ancillary Data Across Respondents
(N = 2498)
50
30
a) Missing Ancillary Data by Variable
(N = 2498)
28.5
40
33.6
30
Proportion of Respondents (%)
13.6
44.7
20
25
15
20
21.2
10
7.8
11.5
5
10
Proportion Missing Ancillary Data (%)
44.7
24.2
3.6
5.4
2.9
Home
Ownership
Household
Income
Household
Size
Marital
Status
.2
0
0
1.8
Education
Age
0
1
Variable
2
3
4
5
6
Number of Variables Missing
Figure F1e – Best Household Match Weight Generous
b) Distribution of Missing Ancillary Data Across Respondents
(N = 1166)
30
26.5
20
24.0
16.3
40
30
34.1
24.4
24.4
20.3
20
40
Proportion of Respondents (%)
50
50
59.0
32.6
10
10
11.1
6.0
3.8
4.1
Home
Ownership
Household
Income
Household
Size
Variable
Marital
Status
.4
0
0
Proportion Missing Ancillary Data (%)
60
a) Missing Ancillary Data by Variable
(N = 1166)
Education
Age
0
1
2
3
4
Number of Variables Missing
5
6
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
72
Figure F1f – Best Household Match Weight Strict
b) Distribution of Missing Ancillary Data Across Respondents
(N = 909)
50
25
a) Missing Ancillary Data by Variable
(N = 909)
50.8
50.8
21.7
10
7.0
40
30
10
4.2
31.1
20
15
Proportion of Respondents (%)
20
20.3
11.9
5
Proportion Missing Ancillary Data (%)
20.8
8.8
5.7
Home
Ownership
Household
Income
Household
Size
Marital
Status
1.6
0
0
1.9
Education
Age
0
1
Variable
2
3
4
5
.1
6
Number of Variables Missing
Figure F1g – Sole Household Match Weight
b) Distribution of Missing Ancillary Data Across Respondents
(N = 672)
50
a) Missing Ancillary Data by Variable
(N = 672)
30
29.2
43.2
4.4
5
40
30
33.7
10.9
10
9.5
43.2
20
Proportion of Respondents (%)
25
14.9
10
15
20
20.5
7.7
2.2
2.2
Home
Ownership
Household
Income
Household
Size
Variable
Marital
Status
.2
0
0
Proportion Missing Ancillary Data (%)
26.1
Education
Age
0
1
2
3
4
Number of Variables Missing
5
6
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
73
Figure F2. Missing Ancillary Data by Self-Reported Value
Figure F2a – Unweighted, with Probability of Sampling Correction Only
50
b) Distribution of Missing Household Income Data
by Self−Reported Income Category
20
40
2
33.1
30
30
33.1
c (7, 2941) = 116.8***
18.6
20
40
Proportion Missing Ancillary Household Income Data (%)
2
12.9
12.9
11.4
10
c (1, 3369) = 501.7***
10
Proportion Missing Ancillary Home Ownership Data (%)
50
a) Distribution of Missing Home Ownership Data
by Self−Reported Home Ownership
5.9
Non−Owner
50−75
75−100
2.2
15−25
25−35
35−50
100−150
d) Distribution of Missing Marital Status Data
by Self−Reported Marital Status
50
c) Distribution of Missing Household Size Data
by Self−Reported Household Size
20
4.5
2.7
c (1, 2661) = 135.1***
40
2
31.1
30
30
40
Proportion Missing Ancillary Marital Status Data (%)
2
20
c (4, 4464) = 28.1***
150+
12.7
10
50
Self−Reported Income Categor y
10
3.9
0
0
1.2
2
3
4
5+
Not Married
e) Distribution of Missing Education Data
by Self−Reported Education Category
f) Distribution of Missing Age Data
by Self−Reported Age Category
50
Self−Reported Marital Status
c2 (4, 2661) = 3.1
40
20.5
38.6
38.6
32.2
30
Proportion Missing Ancillary Age Data (%)
23.4
23.4
16.7
14.9
14.3
55−64
65−74
10
10
20
20.1
20
20.7
c2 (6, 4136) = 281.1***
46.6
40
30
25.1
20.1
Married
Self−Reported Household Size
50
1
Less Than
High School
High School
Graduate
Some
College
College
Degree
Self−Reported Education Categor y
Post−Graduate
Education
0
Proportion Missing Ancillary Household Size Data (%)
0−15
Owner
5.9
Proportion Missing Ancillary Education Data (%)
3.5
Self−Reported Home Ownership
3.9
0
4.5
3.5
0
0
4.6
18−24
25−34
35−44
45−54
Self−Reported Age Category
75+
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
74
Figure F2b – Pure Household Weight
a) Distribution of Missing Home Ownership Data
by Self−Reported Home Ownership
c (1, 1873) = 274.2***
40
50
2
30
33.2
20
20
30
33.2
c (7, 1648) = 56.7***
17.5
13.5
13.5
12.1
10
40
Proportion Missing Ancillary Household Income Data (%)
50
2
10
Proportion Missing Ancillary Home Ownership Data (%)
b) Distribution of Missing Household Income Data
by Self−Reported Income Category
7.0
4.8
Non−Owner
2.6
0−15
Owner
15−25
25−35
35−50
50−75
75−100
100−150
Self−Reported Home Ownership
Self−Reported Income Categor y
c) Distribution of Missing Household Size Data
by Self−Reported Household Size
d) Distribution of Missing Marital Status Data
by Self−Reported Marital Status
c (4, 2494) = 11.6*
3.3
5.6
40
50
2
30
34.8
13.6
4.1
3.7
2.8
c (1, 1614) = 99.9***
20
Proportion Missing Ancillary Marital Status Data (%)
40
30
20
4.1
150+
10
50
2
10
Proportion Missing Ancillary Household Size Data (%)
3.6
0
0
4.2
0
0
1.0
1
2
3
4
5+
Not Married
Married
Self−Reported Household Size
Self−Reported Marital Status
e) Distribution of Missing Education Data
by Self−Reported Education Category
f) Distribution of Missing Age Data
by Self−Reported Age Category
50
50
c2 (4, 1614) = 2.3
c2 (6, 2395) = 201.0***
47.8
45.2
40
24.2
21.0
30
33.7
22.6
17.8
14.9
10
13.7
Less Than
High School
High School
Graduate
Some
College
College
Degree
Self−Reported Education Categor y
Post−Graduate
Education
0
10
20
20.8
20
20.8
22.3
Proportion Missing Ancillary Age Data (%)
40
30
26.0
0
Proportion Missing Ancillary Education Data (%)
45.2
18−24
25−34
35−44
45−54
55−64
Self−Reported Age Category
65−74
75+
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
75
Figure F2c –Household Adult Weight
a) Distribution of Missing Home Ownership Data
by Self−Reported Home Ownership
c (1, 1801) = 261.1***
40
50
2
30
33.2
20
20
30
33.2
c (7, 1584) = 55.6***
17.2
14.1
14.1
12.1
10
40
Proportion Missing Ancillary Household Income Data (%)
50
2
10
Proportion Missing Ancillary Home Ownership Data (%)
b) Distribution of Missing Household Income Data
by Self−Reported Income Category
7.0
4.9
Non−Owner
2.7
2.6
0−15
Owner
15−25
25−35
35−50
50−75
75−100
100−150
150+
Self−Reported Home Ownership
Self−Reported Income Categor y
c) Distribution of Missing Household Size Data
by Self−Reported Household Size
d) Distribution of Missing Marital Status Data
by Self−Reported Marital Status
c (4, 2396) = 11.9*
40
50
2
30
36.0
20
Proportion Missing Ancillary Marital Status Data (%)
40
30
20
c (1, 1570) = 107.8***
13.6
10
50
2
10
Proportion Missing Ancillary Household Size Data (%)
3.4
0
0
4.3
5.7
4.1
4.1
3.5
2.7
0
0
1.0
1
2
3
4
5+
Not Married
Married
Self−Reported Household Size
Self−Reported Marital Status
e) Distribution of Missing Education Data
by Self−Reported Education Category
f) Distribution of Missing Age Data
by Self−Reported Age Category
50
50
c2 (4, 1570) = 2.6
c2 (6, 2396) = 201.6***
48.0
23.7
20.9
40
45.2
30
33.5
22.7
17.9
14.9
10
13.7
Less Than
High School
High School
Graduate
Some
College
College
Degree
Self−Reported Education Categor y
Post−Graduate
Education
0
10
20
20.2
20
20.2
Proportion Missing Ancillary Age Data (%)
40
30
25.9
22.2
0
Proportion Missing Ancillary Education Data (%)
45.2
18−24
25−34
35−44
45−54
55−64
Self−Reported Age Category
65−74
75+
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
76
Figure F2d – Best Respondent Match Weight
a) Distribution of Missing Home Ownership Data
by Self−Reported Home Ownership
b) Distribution of Missing Household Income Data
by Self−Reported Income Category
c (1, 1872) = 273.8***
c (7, 1648) = 56.7***
50
40
30
33.2
20
Proportion Missing Ancillary Household Income Data (%)
50
40
17.5
13.5
13.5
12.1
10
20
30
33.2
10
Proportion Missing Ancillary Home Ownership Data (%)
60
2
60
2
7.0
4.2
3.6
50−75
75−100
Non−Owner
3.3
0−15
Owner
15−25
25−35
35−50
100−150
150+
Self−Reported Home Ownership
Self−Reported Income Categor y
c) Distribution of Missing Household Size Data
by Self−Reported Household Size
d) Distribution of Missing Marital Status Data
by Self−Reported Marital Status
c (4, 2494) = 11.7*
c (1, 1632) = 118.4***
50
40
20
30
36.3
13.3
10
10
20
30
40
50
Proportion Missing Ancillary Marital Status Data (%)
60
2
60
2
Proportion Missing Ancillary Household Size Data (%)
2.6
0
0
4.9
4.1
5.4
4.2
2.9
4.1
0
0
.7
1
2
3
4
5+
Not Married
Self−Reported Marital Status
e) Distribution of Missing Education Data
by Self−Reported Education Category
f) Distribution of Missing Age Data
by Self−Reported Age Category
c2 (4, 1632) = 3.9
c2 (6, 2458) = 292.2*** 62.6
40
50
49.9
30
31.3
21.1
20.3
20
20.5
20
22.0
25.5
Proportion Missing Ancillary Age Data (%)
60
62.6
60
50
40
30
27.3
21.1
17.1
10
14.6
Less Than
High School
High School
Graduate
Some
College
College
Degree
Self−Reported Education Categor y
Post−Graduate
Education
0
10
12.7
0
Proportion Missing Ancillary Education Data (%)
Married
Self−Reported Household Size
18−24
25−34
35−44
45−54
55−64
Self−Reported Age Category
65−74
75+
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
77
Figure F2e – Best Household Match Weight Generous
a) Distribution of Missing Home Ownership Data
by Self−Reported Home Ownership
b) Distribution of Missing Household Income Data
by Self−Reported Income Category
c (1, 852) = 119.4***
c (7, 743) = 27.1***
60
40
41.7
28.4
21.9
23.3
21.9
20
Proportion Missing Ancillary Household Income Data (%)
60
40
41.7
20
Proportion Missing Ancillary Home Ownership Data (%)
80
2
80
2
15.2
11.1
9.4
Non−Owner
0−15
Owner
15−25
25−35
35−50
50−75
75−100
150+
c) Distribution of Missing Household Size Data
by Self−Reported Household Size
d) Distribution of Missing Marital Status Data
by Self−Reported Marital Status
c (1, 740) = 24.0***
5.4
3.0
60
40
23.1
5.3
4.7
0
1.0
0
5.3
40.1
20
Proportion Missing Ancillary Marital Status Data (%)
80
2
80
60
40
20
Proportion Missing Ancillary Household Size Data (%)
100−150
Self−Reported Income Categor y
2
1
2
3
4
5+
Not Married
Married
Self−Reported Household Size
Self−Reported Marital Status
e) Distribution of Missing Education Data
by Self−Reported Education Category
f) Distribution of Missing Age Data
by Self−Reported Age Category
87.0
80
31.6
26.1
Less Than
High School
High School
Graduate
Some
College
24.6
c2 (6, 1132) = 153.1*** 87.0
77.9
60
63.6
40
49.2
34.8
36.2
37.9
26.3
College
Degree
Self−Reported Education Categor y
Post−Graduate
Education
0
20
26.8
20
26.3
Proportion Missing Ancillary Age Data (%)
40
60
80
c2 (4, 740) = 1.1
Proportion Missing Ancillary Education Data (%)
7.8
Self−Reported Home Ownership
c (4, 1163) = 5.5
0
7.1
0
0
9.6
18−24
25−34
35−44
45−54
55−64
Self−Reported Age Category
65−74
75+
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
78
Figure F2f – Best Household Match Weight Strict
a) Distribution of Missing Home Ownership Data
by Self−Reported Home Ownership
b) Distribution of Missing Household Income Data
by Self−Reported Income Category
c (1, 642) = 109.5***
c (7, 586) = 22.8**
40
30
20
Proportion Missing Ancillary Household Income Data (%)
30.9
14.2
14.2
12.9
11.5
10
40
20
30
30.9
10
Proportion Missing Ancillary Home Ownership Data (%)
50
2
50
2
5.2
4.8
1.5
1.7
75−100
100−150
Non−Owner
0−15
Owner
15−25
25−35
35−50
50−75
Self−Reported Home Ownership
Self−Reported Income Categor y
c) Distribution of Missing Household Size Data
by Self−Reported Household Size
d) Distribution of Missing Marital Status Data
by Self−Reported Marital Status
c (4, 904) = 8.3+
150+
c (1, 535) = 44.7***
40
20
30
33.1
9.6
10
10
20
30
40
Proportion Missing Ancillary Marital Status Data (%)
50
2
50
2
Proportion Missing Ancillary Household Size Data (%)
.00000
0
0
2.6
7.7
5.2
5.2
4.5
2.4
0
0
.6
1
2
3
4
5+
Not Married
Self−Reported Marital Status
e) Distribution of Missing Education Data
by Self−Reported Education Category
f) Distribution of Missing Age Data
by Self−Reported Age Category
c2 (4, 535) = 2.6
c2 (6, 904) = 49.2*** 53.7
16.8
40
30
25.4
17.2
16.9
15.7
13.4
Less Than
High School
High School
Graduate
Some
College
College
Degree
Self−Reported Education Categor y
Post−Graduate
Education
0
10
16.8
35.0
20
21.6
Proportion Missing Ancillary Age Data (%)
50
53.7
50
40
30
20
20.7
27.9
10
Proportion Missing Ancillary Education Data (%)
26.2
0
Married
Self−Reported Household Size
18−24
25−34
35−44
45−54
55−64
Self−Reported Age Category
65−74
75+
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
79
Figure F2g – Sole Household Match Weight
a) Distribution of Missing Home Ownership Data
by Self−Reported Home Ownership
b) Distribution of Missing Household Income Data
by Self−Reported Income Category
c (1, 405) = 78.5***
c (7, 380) = 13.7+
70
60
50
40
30
36.6
17.1
17.1
16.9
13.8
9.1
10
20
30
36.6
20
Proportion Missing Ancillary Household Income Data (%)
60
50
40
2
10
Proportion Missing Ancillary Home Ownership Data (%)
70
2
6.4
3.5
3.0
3.2
75−100
100−150
0
0
.00000
Non−Owner
0−15
Owner
15−25
25−35
35−50
50−75
Self−Reported Home Ownership
Self−Reported Income Categor y
c) Distribution of Missing Household Size Data
by Self−Reported Household Size
d) Distribution of Missing Marital Status Data
by Self−Reported Marital Status
c (4, 667) = 7.8
c (1, 296) = 11.0***
70
60
50
40
30
5.2
4.6
1.7
0
0
1.6
1
2
3
4
5+
Not Married
Married
Self−Reported Household Size
Self−Reported Marital Status
e) Distribution of Missing Education Data
by Self−Reported Education Category
f) Distribution of Missing Age Data
by Self−Reported Age Category
c2 (6, 667) = 59.3*** 70.7
20.2
22.4
50
40
22.4
18.4
10
17.3
Less Than
High School
High School
Graduate
Some
College
College
Degree
Self−Reported Education Categor y
Post−Graduate
Education
0
10
18.0
37.5
30
Proportion Missing Ancillary Age Data (%)
23.8
22.5
18.0
48.2
20
30
40
50
60
70
70.7
60
70
c2 (4, 296) = 1.0
24.7
20
18.3
10
5.2
Proportion Missing Ancillary Education Data (%)
38.7
20
20
30
40
50
Proportion Missing Ancillary Marital Status Data (%)
60
2
8.9
10
Proportion Missing Ancillary Household Size Data (%)
70
2
0
150+
18−24
25−34
35−44
45−54
55−64
Self−Reported Age Category
65−74
75+
ACCURACY & COMPLETENESS IN CONSUMER FILE DATA
80
Download