ACCURACY & COMPLETENESS IN CONSUMER FILE DATA Can

Can Marketing Data Aid Survey Research? Examining Accuracy and Completeness in Consumer File Data Josh Pasek* University of Michigan, 105 S. State Street, 5413 North Quad, Ann Arbor, MI, USA 48109. Phone: +1-734-764-6717. Email: jpasek@umich.edu. S. Mo Jang University of South Carolina, 600 Assembly St., Carolina Coliseum RM4011, Columbia, SC, USA 29201. Phone: +1-858-775-4978. Email: mo7788@gmail.com. Curtiss L. Cobb III GfK Custom Research, LLC and Facebook, 1 Hacker Way, Menlo Park, CA, USA 94025. Phone: +1-559-284-0866. Email: ccobb@fb.com. J. Michael Dennis GfK Custom Research, LLC, 2100 Geng Road, Suite 210, Palo Alto, CA, USA 94303. Phone: +1-650-288-1930. Email: jmdstat@yahoo.com. Charles DiSogra Abt SRBI, 275 Seventh Avenue, Suite 2700, New York, NY, USA 10001. Phone: +1-617-3864070. Email: C.DiSogra@srbi.com. RUNNING HEADER: ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 6,751 words 3 Tables 2 Figures ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 2 JOSH PASEK is assistant professor of communication studies and faculty associate, Center for Political Studies, Institute for Social Research, at the University of Michigan, Ann Arbor, MI, USA. S. MO JANG is assistant professor in the School of Journalism and Mass Communications, at the University of South Carolina, Columbia, SC, USA. CURTISS L. COBB III was senior director of survey methodology at GfK Custom Research at the time research was conducted and is currently research scientist at Facebook, Menlo Park, CA, USA. J. MICHAEL DENNIS was managing director of government and academic research at GfK Custom Research, Palo Alto, CA, USA. CHARLES DISOGRA is chief survey scientist at Abt SRBI, New York, NY, USA. Data for the study were provided by GfK Custom Research, LLC. *Address correspondence to Josh Pasek, University of Michigan, 105 S. State Street, 5413 North Quad, Ann Arbor, MI, USA 48109; email: jpasek@umich.edu. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 3 Can Marketing Data Aid Survey Research? Examining Accuracy and Completeness in Consumer File Data Abstract Survey research depends crucially on its ability to collect data from a targeted sample and for that sample to mirror the population of interest. Increasingly, survey firms are using data purchased from marketing firms such as Experian and Acxiom (consumer file marketing data) as a means to improve correspondence between survey respondents and the general public. These data hold tremendous promise, not only for sampling at a reduced cost, but also for allowing researchers to adjust biases that often occur across groups in traditional survey research. Though these new techniques are gaining momentum and currency, there is to date no published research comparing marketing data to more traditionally sampled data. The benefits from using marketing data depend in part on whether the data are both accurate and complete. This paper is the first to systematically assess the quality of one source of consumer file marketing data. Using a unique dataset compiled by GfK KnowledgePanel®, we compare this source of ancillary marketing data with self-report data on the same respondents to ask how frequently the two correspond. We also evaluate conditions under which consumer file data are missing to determine whether patterns in missing data might introduce systematic biases when data are analyzed. Results indicate that the ancillary data differ from self-reported data on a variety of demographic factors. Further, data were missing in patterns that could not be easily addressed. The findings urge caution for those who hope to improve survey administration and design using currently available consumer file data. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 4 Can Marketing Data Aid Survey Research? Examining Accuracy and Completeness in Consumer File Data Survey research depends crucially on its ability to collect data from a targeted sample and for that sample to mirror the population of interest. To achieve these goals, survey methodologists are constantly comparing self-report survey data with data from other sources – known as auxiliary data – as a way to both improve data collection and to adjust for survey nonresponse (Deville, Sarndal, and Sautory 1993; Smith 2011). Since the 1950s, for example, surveys have been compared to benchmarking studies, such as the Current Population Survey, to determine which groups of people may be more or less responsive when sampled (cf. Kessler and Little 1995). Researchers have also explored how paradata (information gathered in the process of survey administration) and linked administrative data (from official sources) might reveal patterns about whether and how people respond to particular kinds of surveys (Calderwood and Lessof 2009; Couper and Lyberg 2005; Sakshaug and Kreuter 2012). As auxiliary data sources, benchmarking, paradata, and administrative records can then be used to adjust for differences in how people respond across groups and to improve administration and targeting during the survey process (Kreuter 2013; Smith 2011). These sources are limited, however, in that adjustments can only be made after households have been sampled for a study, but not prior to the sampling process.1 Recently, an additional type of auxiliary data has surfaced; researchers are purchasing (usually though sample vendors) information from marketing companies such as Experian and Axciom – so called “consumer file data” – which they are appending to survey samples. We refer In the case of benchmarks, this is true because adjustments need to be made on aggregate totals or “marginal” rather than individual-level data. In the case of paradata, this is the case because information is only generated in the process of sampling. Finally, in the case of administrative data, respondent consent is often required for linkage (cf. Sakshaug and Kreuter 2012). 1 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 5 to this particular type of data as ancillary data. Unlike both paradata and benchmark data, consumer file ancillary data can improve survey design even before any responses have been collected. This is because the consumer file data can be purchased in advance of data collection. Consumer file databases also include considerable information about American households ranging from demographic characteristics to partisanship. As an ancillary data source, consumer file marketing data could allow researchers to conduct targeted sampling, to adjust for differences between respondents and the full set of sampled households (i.e. including nonrespondents), and even to generalize from nonprobability samples to the public (Smith 2011). These potential advantages, however, depend in part on our ability to trust the quality of information provided in the marketing databases. The current study represents a first test in evaluating one set of ancillary measures in terms of both their accuracy and completeness as an early-stage inquiry into the potential of consumer file data to enrich our survey toolkit for sampling and weighting. Using a unique dataset that combines survey data sourced from an Address Based Sample (ABS) with consumer file data from a well-regarded commercial source, we examine whether data derived from both survey self-reports and an ancillary consumer file dataset lead to similar conclusions about the households and individuals that respond. Our ability to use these data depends on whether ancillary data values 1) correspond with survey responses (a proxy for accuracy), and 2) are not missing information in ways that could undermine either the sample design applications or the conclusions derived from the data (a proxy for completeness). Understanding Consumer File Data Smith (2011, 393) notes that, “many databases are ‘black boxes’ that do not disclose how they are constructed and what rules are followed.” The provider Experian® (2013), for example, reports that the data come “from more than 3,500 original public and proprietary sources” and ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 6 that the data provided “is tested and validated”. Ironically, they call this process “black box analytics” (Tewksbury and Roy 2012). InfoUSA (2013) indicates the sources of some data, stating that they come from places as diverse as “real estate and tax assessments,” and “voter registration files,” but many important facets of the final data – such as specific sources for the variables – are obscured on proprietary grounds. The firms do not provide information on how many sources of data were aggregated, when the data were obtained, how discrepancies were identified and prioritized, how data sources were linked to one-another, what modes of data collection were used, and the extent to which the data presented represent inferences rather than observations. All this challenges researchers’ ability to evaluate accuracy and completeness. This is a fast-evolving area where customer requests from the public opinion research community can play a useful role in encouraging commercial enterprises to provide more transparency into the sources and construction of their consumer file data. Consumer File Data Quality Studies assessing other forms of auxiliary data for survey administration have highlighted the importance of both accuracy and completeness. These concerns have been raised most notably in comparisons between self-reported questionnaires and official records. A number of recent studies have identified discrepancies when linking survey results with both health records (Davern et al. 2008; Fowles, Fowler, and Craft 1998; Hebert et al. 1999) and official voter statistics (Berent, Lupia, and Krosnick 2011). Researchers initially presumed that such discrepancies were a product of self-report errors. Emerging evidence, however, indicates that incorrect official records are sometimes to blame (Hebert et al. 1999; Berent, Lupia, and Krosnick 2011). Additionally, linking surveyed individuals with official records can introduce sources of bias (Antoni 2011). Given that even official records can introduce errors, it should come as no surprise that less carefully collected data might introduce problems, as has been ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 7 noted in some studies of paradata (West 2013; Sinibaldi, Durrant, and Kreuter 2013). These sorts of errors can massively complicate the goal of using auxiliary sources to improve survey sampling (cf. West and Little 2013), especially if, as Sinibaldi et al. (2013) found, there are differences in accuracy between survey respondents and nonrespondents. Hence, although there are huge differences between different types of auxiliary data, their uses commonly depend on their ability to match survey results and consistent coverage of the population. There are a number of reasons to worry that ancillary consumer file data may suffer from particular limitations in accuracy, completeness, and also currency. Inaccurate data might emerge 1) because information is out of date (e.g. the residents in a household have changed or the data describes someone’s past situation, but not their current status), 2) because marketing data were linked to the wrong individual or household (see Winkler 2006; Yancey 2010), 3) if individuals provided data to marketing companies that were untrue (e.g. filling out a warranty form under an assumed name), 4) if the data were inferred from other information, but happened to be inaccurate (e.g. presuming that anyone who buys diapers is a parent or that anyone who lives in a highly educated neighborhood is highly educated), or 5) if there is an error in reconciling conflicting data from two or more data sources. Mismatches between ancillary consumer file data and self-report information emerged in the one earlier examination of this type (DiSogra, Dennis, and Fahimi 2010), but results have not yet appeared in the peer review literature. DiSogra et al. (2010) found evidence of inconsistent accuracy across variables, indicating the potential for serious inferential problems when both sorts of data are used together.2 Incomplete consumer file data could result either from people who fail to provide the kinds of information that marketing companies use (e.g. not filling out warranty forms, A few additional conference papers have examined the use of marketing data for survey purposes, but have not focused on the key issues of accuracy and completeness addressed here (e.g. Barron et al. 2012; Li et al. 2013; Srinath, Battaglia, and Khare 2004). 2 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 8 registering to vote, deeding property, or borrowing credit), or from an inability to confidently determine which piece of consumer file data should be linked to a particular household or individual. Although DiSogra et al. (2010) noted that a large proportion of households lacked information on some ancillary variables, they did not explore whether missing data represented systematic (rather than solely stochastic) error. Identifying the prevalence and nature of errors in consumer file data thus remains a critical question for understanding their potential utility. If errors in various sources of consumer file data are sufficiently frequent and systematic, the data may not be useful for targeted sampling or nonresponse adjustment. Implications of Data Quality Low quality ancillary data – whether because of misinformation or missing information – presents large challenges for incorporation into survey research. Consider, for example, the use of various forms of auxiliary data to improve the sampling of a group like the U.S. Hispanic population. Traditionally, Hispanic persons have been difficult to sample because they have a lower response rate than non-Hispanic persons (cf. Johnson et al. 2002; Perl, Greely, and Gray 2006; Zambrana and Carter-Pokras 2001). Hence, it might seem efficient to use available data to identify this population in advance and to increase the probability that Hispanic individuals would be sampled. This could be accomplished by targeting individuals with common Hispanic surnames (e.g. Davern et al. 2007; Hazuda et al. 1986; Word and Perkins 1996), areas with high Hispanic population density (e.g. Fiscella and Fremont 2006), or using consumer file data where Hispanic households have been “flagged” (cf. Barron et al. 2012; Li et al. 2013; Link and Burks 2013). If done properly, oversampling procedures might ensure that the Hispanic segment comprised the same proportion of respondents as in the target population (cf. Kalton 2009). But bias could enter this process if some individuals were misclassified in any source of auxiliary ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 9 data, including ancillary data (cf. Swallen et al. 1997). If proper correctives are not employed, non-Hispanic persons who were misclassified as Hispanic might be overrepresented in the final dataset whereas Hispanic persons who were incorrectly classified as Non-Hispanic would be underrepresented. Collectively, this could lead us to mischaracterize the nature of both the Hispanic segment of the sample and the larger population. In a similar vein, an inability to characterize some individuals because of missing ancillary data could also result in inaccurate conclusions (cf. King et al. 2001) or compel researchers to limit the share of the sample allocated to strata dependent on ancillary data. Corrective strategies can be employed to prevent these issues from introducing bias (Estevao and Sarndal 2006; West and Little 2013), but the benefits of stratifying may or may not outweigh the costs (cf. Davern et al. 2007; Santos 1991; Winship and Radbill 1994).3 Hence, the capacity for consumer file data to improve sampling similarly depends on their accuracy and completeness. Similar inferential limitations may confront researchers hoping to use consumer file data to create post-stratification weights or address problematic sampling frames. Such correctives depend on the extent to which ancillary data can discriminate between respondents and the population as a whole (cf. Deville, Sarndal, and Sautory 1993). Although weighting adjustments do not directly depend on the accuracy of ancillary data, most corrective tools require that the processes generating any source of ancillary data are unrelated to distinctions between respondents and nonrespondents (Ibrahim, Lipsitz, and Horton 2001). To the extent that some consumer file data are inferred from other information (e.g. Greenyer 2006), this assumption is The implications of using these kinds of data for sampling efficiency will depend on the accuracy of the ancillary data, the relative sizes of strata, and the differences between individuals in targeted groups that were correctly and incorrectly stratified. 3 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 10 likely to be violated.4 Missing information in consumer file data may also correlate with distinctions between respondents and nonrespondents, which could hinder correctives as well. To uncover the presence and potential implications of these sorts of errors, researchers need to systematically evaluate consumer file data to understand the conditions under which they might be useful and where they may introduce additional complications. The Current Study This research represents a first foray into evaluating the quality of one well-regarded source of consumer file ancillary data for survey purposes. With the aid of a unique dataset from GfK that links an address-based sample (ABS) with ancillary demographic data from a single vendor, we conduct two analyses: Analysis 1 explores correspondence between ancillary data and self-reports by comparing survey responses in the ABS sample to consumer file values about those same households for the same variables. Analysis 2 evaluates the nature of incompleteness in these ancillary data by investigating whether missingness in the ancillary data is ignorable or nonignorable. The results allow us to test whether the ancillary data examined appear to provide an accurate picture of respondents and thus whether the data might lead to improvements in sampling and nonresponse adjustment. Methods Sample Data for all analyses come from GfK Custom Research, LLC. In January of 2011, GfK used the U.S. Postal Service’s Computerized Delivery Sequence File (CDSF) to choose 25,000 random addresses that would be recruited by mail (with telephone follow-ups where numbers were available) for the purpose of having them join KnowledgePanel®, an online probability- The processes distinguishing between actual values and inferred values are likely to correlate with the distinction between respondents and nonrespondents. 4 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 11 based sample of U.S. adults. The CDSF covers over 95 percent of American households, making it one of the broadest potential sampling frames (Iannacchione 2011).5 Because of the breadth of the sampling frame, we could expect a 100% response rate among selected households would closely mirror that of all American households. Addresses were chosen in four strata based on age (household contained and 18-24 year-old person vs. all others and unknown) and Hispanic status (an Hispanic person or surname is associated with the household vs. all others and unknown) as predicted by the ancillary data, weights were used to correct for this decision (see below). Of 25,000 sampled addresses, 2,498 households were successfully recruited to join the panel, a household response rate of 10.0% (AAPOR RR1). This response rate is in line with many current probability sample surveys and thus represents a typical survey circumstance for testing correspondence between data sources. Multiple individuals were allowed to sign up for the panel from each of these households. In total, 4,472 individuals were recruited, with the median household yielding two respondents.6 Because all panel surveys for KnowledgePanel® are completed online, GfK provided a laptop computer, Internet access, or both to panel members for whom these devices/services were not already available. Self-report data came from the Core Adult Profile Survey, the first survey respondents complete upon admission to the panel. Thus, there is no gap between panel admission and unit-level survey response. By using the first available data provided by these panelists, we minimized the potential influence of attrition and panel conditioning. The survey data used for the current study are intended to be illustrative of the quality of samples used broadly by academic and commercial researchers. GfK’s KnowledgePanel has It excludes only locations that do not receive individually addressed mail. These include institutionalized populations (e.g. college dorms and prisons), some New York apartment buildings, and groups like the homeless. 6 The mean number of respondents per household was 1.79. Fewer than 1% of all households yielded more than 4 respondents and none yielded more than 10. 5 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 12 been demonstrated to have error rates similar to RDD (Yeager et al. 2011); therefore, the findings in the current study would appear to be projectable RDD surveys of comparable quality. Consumer File Data Consumer file data were linked to all 25,000 households in the ABS sample.7 These data were provided and matched to addresses by Marketing Systems Group (MSG), the firm that produced the ABS sampling frame. MSG collates data from several sources to compile information on the CDSF addresses; MSG was responsible for matching consumer file data to the CDSF addresses and to one-another. All addresses in the sampling frame had consumer file data for at least one variable. Ancillary data for the current study originally sourced from InfoUSA, Experian, and Acxiom. Since consumer file data were themselves produced through a combination of aggregation and inferential techniques, it was impossible to trace the source of any particular piece of information about a particular household. Despite the opaque nature of individual ancillary measures, the firms providing the information to MSG suggest that these data are ideal for tracking and identifying Americans and are used as such by some researchers or in direct mail campaigns. Acxiom (2011, 10) claims its data “covers more than 99% of marketable addresses worldwide” and incorporates regular updates from the U.S. Postal Service. Experian notes that it excels at linking identities between social media sites, phone numbers (landline and mobile), work and home addresses, email accounts, and other online identifiers (Tewksbury and Roy 2012). And InfoUSA (2013) monitors voter registration, utility, and real estate data to compile information, with monthly updates to keep records current. These three firms represent some of the largest and most well-respected sources of consumer file data. The aggregation across these sources as implemented by MSG Consumer file ancillary data is appended to GfK data, but not the other way around. GfK KnowledgePanel® data is provided by respondents under an agreement where it cannot be linked to outside data sources in an identifiable fashion. Data provided by respondents thus cannot directly influence the consumer file data source, avoiding one possible confound. 7 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 13 should also reduce the number of households for which we are missing data (though this process could result in inconsistent data quality). Hence, we would expect that aggregating across these sources would provide one of the best potential databases for keeping track of the American public, presuming that the data themselves are accurate.8 Variables Homeownership, household income, and household size were measured in both datasets at the household level. Full question wordings and coding for household-level variables in the ancillary data are shown in Online Appendix A.9 Marital status, education, and age were measured in both datasets at the individual level. Question wordings and coding for these variables are shown in Online Appendix B. Table 1 presents descriptive statistics for all measures and their related missingness count among respondents (n=4,472) and for all sampled cases (n=25,000). [INSERT TABLE 1 ABOUT HERE] Weighting Six sets of weights were generated to assess correspondence between self-reported and ancillary distributions across both analyses. For all analyses, we produced weights to correct for GfK’s procedure in stratifying its recruitment sample. Additional weights were then designed to adjust for differences between individual-level self-reports and household-level ancillary data. Correctives were created that either 1) down-weighted data from households with multiple members (for which two approaches were examined) or 2) that sampled only the individual in each household whose age most closely corresponded with the age in the ancillary data (which Because of the nature of these data, we are not able to diagnose the source of discrepancies between ancillary and self-reported data. We also cannot conclude that other sources of ancillary data or other sources of survey data would not result in different results. Such comparisons should be a subject of continuing research. 9 Substantive variables used at both the household level and the individual level were those for which measurement categories could be matched across self-report and ancillary measures. Three ancillary measures for presence of a telephone, race/ethnic status, and number of children in the household, were excluded because the match between these measures in the datasets would have been inconsistent. 8 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 14 was examined with four additional methods). Because the substantive findings did not vary across the six weighting strategies, which are described in full in Online Appendix C, we present only results from those respondents in each household who most closely corresponded with the age values from the ancillary data for all analyses. Analysis 1: Correspondence between Ancillary Data and Self-Reports Analytic Method To evaluate agreement (our proxy for accuracy) between self-reports and the ancillary data examined herein, we compared the distributions of variables in the ancillary data with corresponding measures in the self-reported survey results. This evaluation proceeded in a threestep process. We first assessed the extent to which self-reported and ancillary data revealed consistent information about households (and individuals). High agreement rates indicated relatively accurate ancillary data whereas low agreement rates would call that accuracy into question.10 Second, to understand whether the ancillary and self-report data differed in systematic ways, we explored whether ancillary measures tended to provide systematically larger or smaller values than corresponding self-reports. If larger or smaller values were disproportional, the results would suggest that misclassifications were a product of bias rather than random measurement error. Finally, for measures with more than two categories, we assessed the proportion of responses that differed by a large margin – defined as greater than five years for age and two or more categories for ordinal measures. These “far-off” cases were unlikely to be a product of ancillary data that was simply out of date. Although some of these discrepancies could occur as a product of inaccuracies in survey values, we think this is not a major concern for two reasons. First, there is little evidence of systematic biases in reporting demographic variables in surveys (Calahan 1968; Weaver 2000). Second, web-based survey administration appears to minimize biases (see Chang and Krosnick 2010), further mitigating this likelihood. 10 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 15 Results Correspondence between ancillary and self-reported values differed markedly across variables. The reports largely agreed in cases like homeownership, marital status, and age, with 89.0%, 73.1% and 67.0% agreement, respectively (Table 2). In contrast, household income, household size, and education differed enormously between the data sources (with 22.1%, 27.8%, and 27.2% corresponding, respectively). There was no apparent pattern discriminating between variables with high and low correspondence. These results can be compared with around 90-95% of individuals who report consistent values for changeable demographic variables from year to year (Smith and Stephenson 1979). Ancillary and self-reported measures corresponded at different rates across levels of the same variables, a pattern indicative of bias. Looking only at households with both types of data, homeownership, marital status, and income were consistently higher in the ancillary data relative to self-reports. 23.4% of households self-reported that they did not own their homes, but fully 28.9% of these cases were classified as homeowners in the ancillary data. In contrast, only 5.5% of self-reported homeowners were classified as non-owners in the ancillary data (see Table D1 in Online Appendix D). Perhaps most troublingly, of the 38.8% of individuals who reported they were unmarried in the self-reports, more than half (51.8%) were classified as married in the ancillary data. Yet among individuals reporting that they were married, 89.9% were identically classified in the ancillary data. Self-report and ancillary values did not always diverge in similar patterns. Individuals tended to self-report lower incomes and educational achievement than was apparent in the ancillary data whereas discrepancies in reports of household size and age did not skew as strongly in a single direction. Overall, there did not seem to be a clear pattern for when the values from the two data sources differed in these manners (Table 2; Online Appendix D). ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 16 [INSERT TABLE 2 ABOUT HERE] Large discrepancies between data sources emerged frequently for most variables (see Table 2, % Far-off). Some 43.1% of households reported income that differed from the category suggested in the ancillary data by more than $10,000 per year. The number of occupants reported by a household differed by two or more individuals from the ancillary value in 35.1% of cases. One in four individuals, 24.7%, reported an education level that differed by two or more categories from the ancillary value. And 19.9% of self-reported ages differed from the ancillary value by six or more years, even though respondents were selected for the closest age match. Such large discrepancies seem unlikely to have emerged from slightly outdated consumer file data. We also computed correlations between consumer file and self-reported measures of each variable to test whether data for continuous measures may have differed in some systematic way between the two sources. This could happen if ages were consistently out-of-date or if incomes were overstated in the ancillary data. The correlations (r) varied considerably, but tended to be moderate in strength (ranging from a single low of r=.19 for household size and a range of .39 to .73 for all others). Hence, it seems unlikely that a single source of systematic error was responsible for the discrepancies observed. Discussion Correspondence between the ancillary data and self-reports varied across the six variables examined. Disparities between self-report and ancillary results were fairly large for income, household size, and education; the data streams were generally more consistent when assessing homeownership, though notable biases emerged. At a minimum, the findings indicate that the ancillary data used may not be particularly accurate in their description of individual or ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 17 population parameters. However, ancillary information was considerably better than chance determinations for all variables. In considering uses of ancillary data, current results present a mixed picture. For example, the data could either help or hinder researchers who wish to use the information to better target demographic groups that are traditionally underrepresented in surveys. 11 Typically this process involves an attempt to sample individuals from such groups at disproportionately high rates. Yet identifying targeted individuals has long proven a challenge, because we ordinarily do not know anything about households or individuals before they are sampled. Prior studies have stratified samples using tools such as lists of ethnic surnames (e.g. Davern et al. 2007; Fiscella and Fremont 2006) or heavily-minority Census tracts (ANES 2013; Kalton 2009) to increase the proportion of respondents in these groups. Such strategies can increase the error in a survey estimate even when proper weighting is applied (due to increases in variance). Hispanic persons with traditionally Hispanic surnames might have different experiences from those with names that are more difficult to classify. Similarly, African-American individuals who live in predominantly African-American neighborhoods may have very different experiences from those who live in predominantly White neighborhoods. Hence, this kind of targeted sampling procedure can introduce bias unless proper weights are applied; researchers ignoring this bias when making inferences could reach inaccurate conclusions concerning both targeted groups and society as a whole. Counteracting the potential for bias when using targeted sampling can be complicated. It may undermine the efficiency gained by using the ancillary (or other auxiliary) data in the sampling process. Two sets of weights must be applied to prevent bias (Estevao and Sarndal 2006). First, one set must equalize the probability of selection between individuals in the The term “rare population” is often used to discuss targeted sampling of traditionally underrepresented subpopulations; we avoid that term here in favor of targeted groups because it is also feasible to alter the sampling ratio for large groups within the population (even a majority) based on the use of auxiliary data. 11 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 18 ancillary-defined target groups that were sampled at disproportionately high rates versus “all others” in all other ancillary-defined groups. This step ensures that the easy to classify individuals do not end up defining the category, but also eliminates much of the benefit from the disproportionate sampling. Second, weights can be used to increase representation of the target group (as defined by self-reports) to bring that group back to population proportion. When agreement between ancillary and self-reported values is low, this can actually increase the variance in the estimates and thus the expected error because misidentified individuals in the target population will have even higher weights than they would have without targeted sampling (cf. Santos 1991; Winship and Radbill 1994). The overall precision of a targeted sampling strategy of this sort could either go up or down depending on agreement between the ancillary and self-reported measures as well as the size of the sampling strata. Analysis 2: Missingness in Ancillary Data Analytic Method Three tests were used to assess the scope and nature of missingness in the ancillary data. First, we examined the extent of missingness in the ancillary data for each variable and across cases. Second, we compared missingness in ancillary data variables to self-reports for those same measures. Differential missingness across self-report categories would provide strong evidence of nonignorable missingness. Finally, we used logistic regressions to predict the presence of missingness for each of the ancillary variables based on the values of self-report measures and an OLS regression to predict the number of ancillary variables for which data were missing across cases. Presumably, if ancillary data were Missing Completely at Random (MCAR), missing data should not be concentrated among specific cases, should be unrelated to self-reports of those same variables, and should be impossible to predict precisely with the logistic regressions. If ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 19 ancillary data were Missing At Random (MAR), in a way that could be predicted using observable variables, we should see a strong ability for the logistic regressions to predict missingness. Strong relations between self-reports and ancillary missingness on the same variables, coupled with a relative inability to predict missingness in regressions, would indicate that missingness was almost certainly nonignorable. Missingness Missingness indicator variables were created to identify cases missing ancillary data for each of the variables of interest (0 = presence of data, 1 = absence of data). A total missingness variable was defined as the sum of the six missingness indicators for each case (ranging from 0 to 6). Descriptive Statistics Ancillary data were missing for a large number of cases. On average, 16.5% of households were missing data for any given ancillary data variable. Missingness varied considerably across households from only 3.6% of cases missing for household size to 28.5% of households missing age information (Figure 1, histogram a). [INSERT FIGURE 1 ABOUT HERE] We also explored how missing ancillary data varied across respondents. Although the modal household was not missing data for any of the ancillary variables (44.7%; Figure 1, histogram b), missingness was not concentrated in only a few households; only 10.3% were missing data for more than two variables (Figure 1, b). Of households missing data, the vast majority was missing information for only a single variable. To examine whether cases lacking data on one variable were also likely to have missing data on other variables, we conducted a reliability test among six missingness indicator variables. Cronbach’s Alpha was .59, indicating a ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 20 moderate consistency among missingness indicators, but not enough to consider them a single factor. Ancillary Missingness by Self-Report Value Missing ancillary data appeared distinctly nonrandom when compared with self-reported household status on the same variables. Ancillary homeownership was missing for 4.9% of selfreported home-owning households and 33.2% of non-owning households (Figure 2, a; 2(1) = 273.8, p < .001). Ancillary income was missing systematically for households with lower reported incomes. 13.5% of ancillary data were missing for households reporting an income of below $35,000 per year whereas only 3.3% of ancillary data were missing for households with incomes above $150,000 (Figure 2, b; 2(7) = 56.7, p < .001). Smaller households were missing more information about household size than were larger households (Figure 2, c; 2(4) = 11.7, p<.05). These results refuted the possibility that these ancillary data were MCAR and indicated that missing ancillary household data were likely to be nonignorable. [INSERT FIGURE 2 ABOUT HERE] Rates of missingness in individual-level ancillary data also frequently depended on selfreports for the same variables. When respondents reported that they were married, only 13.3% of ancillary marital status data were missing. In contrast, 36.3% of ancillary marital information was missing among unmarried individuals (Figure 2, d; 2(1) = 118.4, p<.001). Variation in missingness across self-reported education levels was not statistically significant (Figure 2, e; 2(4) = 3.9, p = .84). Finally, missing ancillary age information was more common among younger individuals. For individuals aged 18-24, 62.6% of ancillary age data were missing; only 12.7% of ancillary age data were missing for 55-64 year olds (Figure 2, f; 2(6) = 292.2, p < .001). As with household-level variables, missing data for individual-level variables appeared systematic. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 21 Regressions To understand patterns of missingness, regressions predicted the presence of missingness for each of the ancillary data measures and the total number of missing ancillary variables as a function of each respondent’s status on self-reported demographics. To conduct these regressions, multiple imputations were used to account for missingness in self-reported values among all respondents.12 All regression results were weighted as described above. Predictors of missing ancillary data varied depending on the missingness indicator being predicted. Three self-reported demographics predicted multiple missingness indicators: homeownership, household size, and age. Compared to non-owners, self-reported owners were less likely to lack ancillary data for homeownership, household income, marital status, and age (Table 3, row 1). The self-reported number of persons in the household predicted missing ancillary income, marital status, and age, with larger households translating into a reduced likelihood of missingness. Age predicted nonlinearly; middle-aged Americans were the least likely group to be missing information for marital status or age. Self-reported marital status and education each predicted one of the missingness indicators. Married individuals were, ceteris paribus, less likely than unmarried individuals to be missing information on marital status. Individuals who reported that they had less than a high school education (the omitted category) were the most likely to be missing ancillary age information. [INSERT TABLE 3 ABOUT HERE] Despite significant predictors for most missingness indicators, missing ancillary data was not well predicted in the current analyses. The McFadden’s pseudo R2 for missingness indicators 12Multiple imputations were conducted by using Multiple Imputations via Chained Equations predicting each missing value with all self-reports and ancillary demographic variables (Buuren and Groothuis-Oudshoorn 2011). Running these same regressions without imputations led to the same conclusions. Imputed versions were used to avoid the possibility that missingness in self-reports would bias the results. Hence, values for all self-reported variables were imputed. Imputations were conducted to mirror the full set of respondents (not all sampled households). Imputations were only used for regression models. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 22 was always below .20, suggesting that we could not effectively account for when data were missing,13 and the full list of covariates improved the percent of cases correctly predicted over a null model for only one of the six indicators: missing ancillary age information (Table 3). Although the lack of strong prediction does not indicate that these covariates were irrelevant, they do imply that we have an incomplete understanding of the circumstances under which ancillary data were missing. Using an OLS model, predictions of the number of ancillary measures missing have revealed similar challenges. Missingness remained poorly explained even though evidence suggested that overall missingness was related to less self-reported homeownership, larger household sizes, unmarried status, and relative youth (Table 3, column 7). These variables again captured only a small portion of the variance across individuals (R2 = .12), though the model may be limited due to the small number of individuals missing multiple ancillary measures.14 Discussion Results of analysis 2 suggested that the ancillary data examined were missing in ways that could be problematic for survey research; specifically, they appeared to represent a nonignorable source of bias. Missingness was more common for some measures than for others and varied across categories of self-reports for the same variables. This presents a series of problems for researchers hoping to use ancillary data for sampling or to correct for known survey errors as relevant data may be missing. Patterns of missingness in the ancillary data were difficult to predict with simple covariates or regression techniques. Because missing data appeared to violate both Missing McFadden’s Pseudo R2 is an estimate of the proportion of variance accounted for in the tested model as compared to a null model with only an intercept. The statistic represents the proportion of the total log likelihood of the null model that is explained by the fitted model. It is calculated as 1-( ln(Lfitted)/ln(Lnull)). For most purposes, it can be interpreted similarly to an R2 statistic and reveals the approximate proportion of the residual variance in a null model that is explained by the inclusion of all predictors. 14 Prediction may be slightly better than reported, however, given that the data are left-skewed and do not meet assumptions of normality implicit in OLS regression. Notably, however, no additional variance was explained by treating missingness as a negative binomial, indicating that such gains are likely to be minimal. 13 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 23 Completely At Random (MCAR) and Missing At Random (MAR) assumptions, the use of these ancillary data for analytic techniques could result in substantive error. For example, only 37% of 18-24 year householders could be identified using the ancillary age data. Conclusions using only this group might not mirror the 63% of 18-24 year olds whose households lacked ancillary age data. Researchers hoping to use ancillary data such as those presented here for either oversampling or corrections might therefore be well advised to conduct an analysis of how sensitive their use of the data would be to violations in assumption about the data’s accuracy and completeness. The most pernicious of our results indicated that missingness in these ancillary data was often related to self-reported values of the same variables. This is a major problem because it means that errors from the missing ancillary data may only be apparent once self-reports have been collected. Of course, this undermines one of the biggest advantages in the use of these consumer file data – namely that they could be used prior to the sampling process. Because ancillary missingness correlated with self-reports, sampling strategies based on this set of ancillary data would likely result in sampled units that are differentially accurate across different variables (e.g. we do a much better job identifying homeowners across variables than we do at identifying home renters). Instead of reducing variance in the weights required for such a sample, oversampling with the use of ancillary variables instead necessitates a two-stage weighting process (i.e. oversampled young people need to be down-weighted before actual young people can be adjusted to match their population proportions).15 Depending on the specific variables and Readers should note that this corrective will only work if all oversampled individuals are downweighted (regardless of whether they are actually in the target category or not) to correct for the stratification procedure and if self-reports are used to weight the targeted group to match the population. Importantly, the method for collecting population-level benchmark data on the targeted group must also match the method used for generating this information about individuals (or households) in the survey for the final poststratification to yield an unbiased set of weights. 15 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 24 sampling ratios in play, such corrections might sometimes increase the variance and design effect of sample weights instead of reducing them (cf. Santos 1991; Winship and Radbill 1994). General Discussion Use of consumer file data for sample targeting and as a survey corrective is growing (Smith 2011). We can reasonably expect the consumer file products to improve rapidly as a result of the commercial sector’s investment in the data sciences. There is productive role for public opinion researchers to collaborate with the commercial sector in conducting specific tests aimed at improving the accuracy and completeness of the consumer file data, and creating new data products tailored to the sampling and weighting needs of the survey research field. This is the first study to explore how one vendor source of these data derived from multiple commercial sources compares with more traditional self-report measures. The current analyses represent a first foray into understanding the nature of potential biases when using ancillary marketing data to supplement (or supplant) the traditional survey process. In two analyses we assessed accuracy and completeness in one source of consumer file marketing data. The analyses presented provide some hopeful, but many discomforting signs for researchers hoping to use at least the current source of ancillary consumer file data to bolster survey research. We thus conclude that survey researchers should carefully consider the potential implications of systematic bias and missingness in consumer file marketing data before incorporating datasets into sampling and weighting procedures. The consumer file data we examined was not consistently inaccurate. For some measures, agreement rates between consumer file and self-reported information were very high. For other measures, agreement rates were little better than chance. This pattern might emerge if some variables or sources effectively reflect the population even while others do not. Continued ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 25 assessments will be needed to determine if certain classes of variables or sources of data provide consistent agreement. Variables and sources where ancillary data correspond with self-reports should provide the best chance for improving both survey administration and correction (West and Little 2013). Similar variability was observed in patterns of missing ancillary data in the current analyses. Information about some variables, such as household size, was far more complete than information about others, such as marital status. Further, patterns of missingness appeared stochastic for some measures, such as education, whereas missingness in other measures appeared to be highly systematic, such as age. This too presents a mixed picture. We should be wary of the large amount of missing ancillary data in thinking about survey correctives, but evidence of variability could imply that some ancillary measures may not pose systematic problems. Identifying reliable measures would be of considerable value for and should be the subject of further investigation. Sources of Inconsistency Inconsistencies between consumer file data and self-reports that emerged in this study are difficult to diagnose. They could have appeared for a variety of reasons. Perhaps these ancillary data were out of date, perhaps they were products of inference on the part of data aggregators, perhaps the tools used to link ancillary data with addresses were flawed, or perhaps this was a function of the aggregation procedures used at the single firm examined. Generally consistent results with increasingly strict weighting strategies (see Online Appendix F) suggest that mismatches between individuals within a household were unlikely to account for all discrepancies observed. It is also possible that the ancillary measures are capturing something fundamentally different from survey responses (though what, exactly, would be unclear) or that survey misreporting may account for some of the discrepancies. The roots of missingness are ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 26 similarly opaque. It is clear at a minimum that researchers cannot blindly trust that both survey and ancillary data are highly accurate when they give such discrepant results. Whether other sources of consumer file data will provide results that consistently match survey responses remains an important topic for future research. Of course, we wish that we could open the black box and critically evaluate the procedures used for each step in the process of generating the consumer file data. Because these data are of considerable value to the private companies that aggregate them, however, social scientists seem unlikely to gain a full picture. Meanwhile, there remains considerable reward in evaluating the ways that even this flawed data may yet aid survey research. The fact that an outside firm was able to match one source of consumer file data to the entire sample and the generally decent correspondence between the ancillary and self-reported data examined suggest that such data could prove useful for some survey purposes. Specifically, even data that is only somewhat accurate may be able to help researchers conduct targeted sampling for underrepresented populations and adjust for survey nonresponse. We discuss some of these possibilities in Online Appendix E. Limitations and Future Research This study presents results that pertain to a handful of demographic variables from a single source of consumer file data. They indicate that these particular data may prove problematic for a variety of research purposes. But there is much that remains unknown. We do not know whether similar inaccuracies and omissions might complicate the use of other sources of consumer file data. Correspondences might differ if data were matched to addresses in a different way, derived from a different set of sources, or collected at different points in time. Differences might also be observed if ancillary data were compared with survey data derived using different sampling strategies (e.g., telephone sampling) or utilized different recruitment ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 27 tools (e.g., without transitioning respondents to a web panel). Further, data on alternate types of variables may vary in their correspondence with survey data and completeness. We also cannot conclude that ancillary data are inappropriate for any particular use without further examination. For most purposes, use of any type of auxiliary data – including the ancillary consumer file data examined here – represents a tradeoff between the information that can be gained through the use of a particular data source and the errors that emerge in the data collection process. The value of the data for mitigating error depends on how those factors relate to one another, not on the absolute accuracy or completeness of the sources. Conclusions In theory, consumer file marketing data would appear to offer a valuable resource for improving survey design and implementation. In practice, inconsistencies between one source of consumer file data and self-reports coupled with patterns of systematic missingness lead to questions over how well we will be able to leverage these possibilities. The correspondence and bias observed when comparing self-reported demographic data with the consumer file data presented here suggest that researchers should proceed with caution. Awareness of the accuracy and completeness of any source of ancillary information is an important prerequisite to its use. Hence, instead of blindly assuming that these data will present an accurate portrait of the American public, researchers should instead consider the potential improvements that the data could offer as a set of open empirical questions. These queries, more than the ease and ability of using consumer file data, should guide practitioners in their decision-making. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 28 References Acxiom. 2011. “Reaching More Consumers with Certainty.” Acxiom White Paper. ANES. 2013. “User’s Guide and Codebook for the Preliminary Release of the ANES 2012 Time Series Study.” Electionstudies.org. Ann Arbor, MI and Palo Alto, CA: University of Michigan and Stanford University. Antoni, Manfred. 2011. “Linking Survey Data with Administrative Employment Data: the Case of the IAB-ALWA Survey.” http://doku.iab.de/fdz/events/2011/Antoni_presentation.pdf. Barron, Martin, Michael Davern, Robert Montgomery, Xian Tao, Kirk Wolter, Wei Zeng, Christina Dorell, and Carla Black. 2012. “Can Information From Market Research Companies Be Used to Develop an Efficient Sampling Strategy for a Rare Population?” New Orleans, LA: Hard to Reach Conference. Berent, Matthew, Arthur Lupia, and Jon A. Krosnick. 2011. “The Quality of Government Records and Over-Estimation of Registration and Turnout in Surveys: Lessons From the 2008 ANES Panel Study’s Registration and Turnout Validation Exercises.” nes012554. American National Election Studies Working Papers. http://www.electionstudies.org/resources/papers/nes012554.pdf. Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “MICE: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3). Calahan, Don. 1968. “Correlates of Respondent Accuracy in the Denver Validity Survey.” Public Opinion Quarterly 32 (4): 607–21. Calderwood, Lisa, and Carli Lessof. 2009. “Enhancing Longitudinal Surveys by Linking to Administrative Data.” In Methodology of Longitudinal Surveys, edited by Peter Lynn, 55–72. Chichester, UK: Wiley. doi:10.1002/9780470743874.ch4. Chang, LinChiat, and Jon A. Krosnick. 2010. “Comparing Oral Interviewing with Self- ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 29 Administered Computerized Questionnaires: an Experiment.” Public Opinion Quarterly 74 (1): 154–67. doi:10.1093/poq/nfp090. Couper, Mick P, and Lars Lyberg. 2005. “The Use of Paradata in Survey Research.” Proceedings of the 55th Annual Meeting of the International Statistical Institute. Sydney, Australia. Davern, Michael, Donna McAlpine, Jeanette Ziegenfuss, and Timothy J. Beebe. 2007. “Are Surname Telephone Oversamples an Efficient Way to Better Understand the Health and Healthcare of Minority Group Members?” Medical Care 45 (11): 1098–1104. Davern, Michael, Kathleen Thiede Call, Jeanette Ziegenfuss, Gestur Davidson, Timothy J. Beebe, and Lynn Blewett. 2008. “Validating Health Insurance Coverage Survey Estimates: A Comparison of Self-Reported Coverage and Administrative Data Records.” Public Opinion Quarterly 72 (2). AAPOR: 241–59. Deville, Jean-Claude, Carl-Erik Sarndal, and Olivier Sautory. 1993. “Generalized Raking Procedures in Survey Sampling.” Journal of the American Statistical Association 88 (423). American Statistical Association: 1013–20. DiSogra, Charles, J. Michael Dennis, and Mansour Fahimi. 2010. “On the Quality of Ancillary Data Available for Address-Based Sampling.” Proceedings of the Survey Research Methods Section of the American Statistical Association. 4174–83. Estevao, Victor M, and Carl-Erik Sarndal. 2006. “Survey Estimates by Calibration on Complex Auxiliary Information.” International Statistical Review / Revue Internationale De Statistique 74 (2). International Statistical Institute (ISI): 127–47. Experian. 2013. “Data Quality.” Experian. Accessed May 28. http://www.experian.com/dataselect/ds-data-quality.html. Fiscella, Kevin, and Allen M. Fremont. 2006. “Use of Geocoding and Surname Analysis to ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 30 Estimate Race and Ethnicity.” Health Services Research 41 (4p1): 1482–1500. doi:10.1111/j.1475-6773.2006.00551.x. Fowles, Jinnet B., Elizabeth J. Fowler, and Cheryl Craft. 1998. “Validation of Claims Diagnoses and Self-Reported Conditions Compared with Medical Records for Selected Chronic Diseases.” The Journal of Ambulatory Care Management 21 (1). 24–34. Greenyer, Andrew. 2006. “Back From the Grave: the Return of Modelled Consumer Information.” International Journal of Retail & Distribution Management 34 (3): 212–18. doi:10.1108/09590550610654375. Hazuda, Helen P., Paul J. Comeaux, Michael P. Stern, Steven M. Haffner, Clayton W. Eifler, and Marc Rosenthal. 1986. “A Comparison of Three Indicators for Identifying Mexican Americans in Epidemiologic Research.” American Journal of Epidemiology 123 (1): 96–112. Hebert, Paul L., Linda S. Geiss, Edward F. Tierney, Michael M. Engelgau, Barbara P. Yawn, and A. Marshall McBean. 1999. “Identifying Persons with Diabetes Using Medicare Claims Data.” American Journal of Medical Quality 14 (6): 270–77. doi:10.1177/106286069901400607. Iannacchione, Vincent G. 2011. “The Changing Role of Address-Based Sampling in Survey Research.” Public Opinion Quarterly 75 (3): 556–75. doi:10.1093/poq/nfr017. Ibrahim, Joseph G., Stuart R. Lipsitz, and Nick Horton. 2001. “Using Auxiliary Data for Parameter Estimation with Non-Ignorably Missing Outcomes.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 50 (3): 361–73. infoUSA. 2013. “Data Quality.” Infousa.com. Accessed May 28. http://www.infousa.com/dataquality/. Johnson, Timothy P., Diane O'Rourke, Jane Burris, and Linda Owens. 2002. “Culture and Survey Nonresponse.” In Survey Nonresponse, edited by Robert M. Groves, Don A. Dillman, ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 31 J. L. Eltinge, and Roderick J. A. Little, 55–69. New York: Wiley. Kalton, Graham. 2009. “Methods for Oversampling Rare Subpopulations in Social Surveys.” Survey Methodology 35 (2): 125–41. Kessler, Ronald C., and Roderick J. A. Little. 1995. “Advances in Strategies for Minimizing and Adjusting for Survey Nonresponse.” Epidemiologic Reviews 17 (1): 192–204. King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” The American Political Science Review 95 (1): 49–69. Kreuter, Frauke. 2013. “Facing the Nonresponse Challenge.” The Annals of the American Academy of Political and Social Science 645 (1): 23–35. doi:10.1177/0002716212456815. Li, Ying, Whitney Murphy, Gillian Lawrence, Jennifer Vanicek, Kari Carris, and Felicia LeClere. 2013. “Hola or Hello? A Priori Assignment of Interview Language Using Demographic Flags.” Annual Conference of the American Association for Public Opinion Research. Boston, MA. Link, Michael W., and Anh Thu Burks. 2013. “Leveraging Auxiliary Data, Differential Incentives, and Survey Mode to Target Hard-to-Reach Groups in an Address-Based Sample Design.” Public Opinion Quarterly 77 (3): 696–713. doi:10.1093/poq/nft018. Perl, Paul, Jennifer Z. Greely, and Mark M. Gray. 2006. “What Proportion of Adult Hispanics Are Catholic? A Review of Survey Data and Methodology.” Journal for the Scientific Study of Religion 45 (3): 419–36. Sakshaug, Joseph W., and Frauke Kreuter. 2012. “Assessing the Magnitude of Non-Consent Biases in Linked Survey and Administrative Data.” Survey Research Methods 6 (2): 113–22. Santos, Robert L. 1991. “One Approach to Oversampling Blacks and Hispanics: the National Alcohol Survey.” Available from: ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 32 https://www.amstat.org/sections/SRMS/Proceedings/papers/1985_031.pdf. Sinibaldi, Jennifer, Gabrielle B. Durrant, and Frauke Kreuter. 2013. “Evaluating the Measurement Error of Interviewer Observed Paradata.” Public Opinion Quarterly 77 (S1): 173–93. doi:10.1093/poq/nfs062. Smith, Tom W. 2011. “The Report of the International Workshop on Using Multi-Level Data From Sample Frames, Auxiliary Databases, Paradata and Related Sources to Detect and Adjust for Nonresponse Bias in Surveys.” International Journal of Public Opinion Research 23 (3): 389–402. doi:10.1093/ijpor/edr035. Smith, Tom W., and C. Bruce Stephenson. 1979. “An Analysis of Test/Retest Experiments on the 1972, 1973, 1974, and 1978 General Social Surveys.” 8. Publicdata.Norc.org. Chicago: GSS Methodological Report. Srinath, K. P., Michael P. Battaglia, and Meena Khare. 2004. “A Dual Frame Sampling Design for an RDD Survey That Screens for a Rare Population.” Proceedings of the Survey Research Methods Section of the American Statistical Association, 4424-29. Swallen, Karen C., Dee W. West, Susan L. Stewart, Sally L. Glaser, and Pamela L. Horn-Ross. 1997. “Predictors of Misclassification of Hispanic Ethnicity in a Population-Based Cancer Registry.” Annals of Epidemiology 7 (3): 200–206. doi:10.1016/S1047-2797(96)00154-8. Tewksbury, Marcus, and Andy Roy. 2012. “The Experian Marketing Innovation Report 2012.” Experian. http://www.experian.com/assets/marketing-services/reports/ems_2012_marketinginnovation_report.pdf. Weaver, David A. 2000. “The Accuracy of Survey-Reported Marital Status: Evidence From Survey Records Matched to Social Security Records.” Demography 37 (3): 395–99. doi:10.2307/2648050. West, Brady T. 2013. “An Examination of the Quality and Utility of Interviewer Observations in ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 33 the National Survey of Family Growth.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 176 (1): 211–25. doi:10.1111/j.1467-985X.2012.01038.x. West, Brady T., and Roderick J. A. Little. 2013. “Non‐Response Adjustment of Survey Estimates Based on Auxiliary Variables Subject to Error.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 62 (2): 213–31. Winkler, William E. 2006. “Overview of Record Linkage and Current Research Directions.” Statistics #2006-2. Statistical Research Division, U.S. Census Bureau, Research Report Series. Winship, Christopher, and Larry Radbill. 1994. “Sampling Weights and Regression Analysis.” Sociological Methods & Research 23 (2): 230–57. doi:10.1177/0049124194023002004. Word, David L, and R Colby Perkins Jr. 1996. “Building a Spanish Surname List for the 1990's: a New Approach to an Old Problem.” Technical Working Paper No. 13. U. S. Bureau of the Census. Washington, D.C.: U. S. Bureau of the Census. Yancey, William E. 2010. “Expected Number of Random Duplications Within or Between Lists.” Proceedings of the Section on Survey Research Methods, American Statistical Association. 2938-46. Yeager, David S., Jon A. Krosnick, LinChiat Chang, Harold S. Javitz, Matthew S. Levendusky, Alberto Simpser, and Rui Wang. 2011. “Comparing the Accuracy of RDD Telephone Surveys and Internet Surveys Conducted with Probability and Non-Probability Samples.” Public Opinion Quarterly 75 (4): 709–47. doi:10.1093/poq/nfr020. Zambrana, Ruth E., and Olivia Carter-Pokras. 2001. “Health Data Issues for Hispanics: Implications for Public Health Research.” Journal of Health Care for the Poor and Underserved 12 (1): 20–34. doi:10.1353/hpu.2010.0547. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 34 Table 1. Unweighted Descriptive Characteristics of Six Demographic Self-Report and Ancillary Variables and the Number of Cases Missing Data for Each Variable Self-Report Data Percent Missing data (count) Ancillary Data (Respondents only) Missing data Percent (count) 1103 Ancillary Data (All sampled cases) Missing data Percent (count) 594 4456 Homeownership Owner 69.9% 79.4% 76.9% Non-Owner 30.1% 20.6% 23.1% Income < 15k 1531 326 2591 12.7% 3.8% 4.9% 15k – 25k 8.6% 6.6% 8.3% 25k – 35k 9.1% 9.3% 11.2% 35k – 50k 16.8% 15.6% 16.8% 50k – 75k 21.3% 23.5% 23.5% 75k – 100k 13.3% 17.4% 14.9% 100k – 150k 12.9% 16.7% 13.5% 5.2% 7.1% 6.9% > 150k 8 167 2068 Household size 1 person 14.9% 24.2% 33.9% 2 persons 32.3% 26.9% 27.3% 3 persons 20.7% 19.8% 16.6% 4 persons 19.4% 13.3% 10.2% 5+ persons 12.7% 15.7% 12.0% 1811 986 6903 Marital Status Married 55.6% 77.1% 71.1% Not Married 44.4% 22.9% 28.9% Education Less than HS 30.2% 16.7% 24.6% High School 27.6% 25.9% 23.0% Some College 1811 910 5392 30.6% 29.5% 28.7% Bachelors degree 8.8% 17.6% 14.9% Post Grad/Professional 2.8% 10.2% 8.8% 8 Age Age – mean Total N 43.3 years (sd=17.4y) 4472 1272 51.0 years (sd=13.7y) 4472 8660 50.5 years (sd=16.2y) 25000 Table 2. Variable Value Comparisons between Survey and Ancillary Data (% Respondents) % Survey % Survey % Survey % Far-off < Ancillary = Ancillary > Ancillary Total cases N Homeownership 6.8% 89.0% 4.2% 100.0% -1620 Household Income 51.1% 22.1% 26.8% 100.0% 43.1% 1524 Household Size 39.2% 27.8% 33.0% 100.0% 35.1% 2404 Marital Status 20.1% 73.1% 6.8% 100.0% -1241 Education 53.2% 27.2% 19.6% 100.0% 24.7% 1275 Age 20.8% 67.0% 12.2% 100.0% 19.9% 1782 Corr. (r) .68 .48 .19 .41 .39 .73 Note: Far-off cases are counted when values from self-reported survey and ancillary data differ by more than one category (household income, household size, and education) or more than five years (for ages). Far off cases were not computed for dichotomous variables. Ages within one year were considered equivalent. N is the weighted overlap of non-missing cases between ancillary and self-report measures. All numbers are weighted by the best respondent match weight (see Online Appendix F for alternatives) ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 36 Table 3. Logistic Regressions Predicting Missing Ancillary Data with Self-Reports Missing Homeownership Homeowner Missing Household Income -1.27 *** (.16) Income - $15,000-24,999 Income - $25,000-34,999 Income - $35,000-49,999 Income - $50,000-74,999 Income - $75,000-99,999 Income - $100,000-149,999 Income - $150,000 or More .10 .13 -.26 -.33 -.61 -.72 -.52 (.24) (.25) (.27) (.25) (.27) (.37) (.46) .13 .53 + .04 -.04 -.20 -.14 -.23 2 Persons in Household 3 Persons in Household 4 Persons in Household 5 or More Persons in Household -.10 -.21 -.39 -.36 (.19) (.21) (.23) (.25) -.28 -.47 + -.81 * -1.70 *** Married -.25 (.17) Education - High School Degree Education - Some College Education - College Degree Education - Graduate Degree .10 -.14 -.07 -.06 Age Age2 -.03 .0001 Intercept .48 McFadden's Pseudo R2/R2 -2 Log Likelihood Percent Correctly Predicted (PCP) Null Percent Correctly Predicted N * + + Missing Marital Status .10 (.27) -.50 *** (.12) (.32) (.30) (.38) (.36) (.44) (.48) (.62) -.64 -.01 .25 .16 -.12 .26 .36 (.70) (.50) (.47) (.43) (.52) (.54) (.67) .03 .04 -.19 -.24 -.39 -.28 -.61 + (.21) (.20) (.18) (.18) (.24) (.29) (.34) (.24) (.28) (.33) (.47) -.18 .48 .22 -.77 (.35) (.35) (.41) (.58) -.18 -.32 -.46 -.86 (.14) * (.16) * (.19) *** (.22) -.22 (.25) -.20 (.32) -.49 (.19) (.23) (.29) (.47) .10 -.08 -.06 .22 (.26) (.30) (.43) (.55) .07 -.37 .17 .32 (.30) (.38) (.48) (.72) (.02) (.0003) -.01 -.0002 (.03) (.0004) -.05 .0005 (.04) (.0004) (.49) -.86 (.69) .17 1133.9 86.8% 86.9% 3199 -1.31 *** (.25) Missing Household Size .17 585.1 94.1% 94.1% 3199 -2.31 ** (.87) .03 586.7 96.5% 96.5% 3199 Missing Education Number of Variables Missing (.14) -.99 *** (.12) (.26) (.28) (.26) (.22) (.23) (.25) (.29) .15 .15 -.04 -.27 -.42 -.36 -.10 (.22) (.19) (.23) (.18) (.23) (.29) (.27) .04 .13 -.02 -.08 -.14 + -.13 -.06 (.10) (.08) (.10) (.07) (.08) (.10) (.11) .19 -.06 -.30 -.10 (.16) (.19) (.21) (.22) -.20 -.48 -.46 -.56 (.15) (.17) (.18) (.20) -.07 -.18 ** -.28 *** -.38 *** (.06) (.07) (.08) (.08) *** (.14) .02 (.16) .02 (.12) -.10 + (.05) -.02 -.16 -.07 -.10 (.16) (.15) (.21) (.40) .02 -.11 .12 .08 (.15) (.17) (.22) (.35) -.42 -.32 -.09 -.27 (.13) (.14) (.18) (.41) -.05 -.12 * -.002 -.02 (.06) (.06) (.10) (.15) -.03 .0004 * (.02) (.0002) .01 .0000 (.02) (.0002) -.08 *** (.02) .0004 * (.0002) -.03 *** (.01) .0003*** (.0001) 2.75 2.47 *** (.17) .15 (.39) .06 1972.8 77.3% 77.4% 3199 -.04 Missing Age .06 .33 .34 .34 .38 + .21 .61 * -1.94 *** (.46) .01 1851.7 81.1% 81.1% 3199 + ** * ** ** * *** (.40) .13 1715.2 74.8% 71.5% 3199 -.53 *** (.06) .12 3199 Note: Standard errors in parentheses. OLS was used to predict the number of missing variables for each individual (Column 7). Number of missing variables ranged from 0 to 6. A negative binomial regression provided a poorer overall fit and was therefore not presented. All regressions were weighted using best respondent match weights. Ns reflect total number of non-zero weighted cases because weights were set to a mean of one for this analysis. All numbers reflect results after multiple imputation. + p<.10; *p < .05; **p < .01; ***p < .001 two-tailed. Figure 1. Missing Ancillary Data by Variables and Respondents (Best Respondent Match Weights) b) Distribution of Missing Ancillary Data Across Respondents (N = 2498) 30 50 a) Missing Ancillary Data by Variable (N = 2498) 28.5 25 44.7 40 24.2 21.2 13.6 10 20 15 30 20 33.6 7.8 5 10 11.5 3.6 5.4 2.9 Home Ownership Household Income Household Size Marital Status .2 0 0 1.8 Education Age 0 1 2 3 4 5 6 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA Figure 2. Missing Ancillary Data by Self-Reported Variable Values (Best Respondent Match Weights) 38 Online Appendix A. Question Wordings for Household-Level Variables Homeownership (Nominal; 2 categories) Self-report. Respondents were asked: “Are your living quarters... Owned or being bought by you or someone in your household, rented for cash, or occupied without payment of cash rent.” Respondents who selected “Owned or being bought by you or someone in your household” were coded 1, all other answers were coded 0. Ancillary. Ancillary homeownership data were categorized 1 for homeowners and 0 for non-owners. Recoding. Survey respondents reported their homeownership in three categories, “owned or being bought by you or someone in your household”, “rented for cash”, “occupied without payment of cash rent”. To facilitate comparisons with the ancillary data, the self-reported responses were recoded into two categories, “homeowners” and “non-owners”. Household Income (Ordinal) Self-report. Respondents were asked: “Was your total HOUSEHOLD income in the past 12 months ...” Respondents could choose: “Below $35,000”, “$35,000 or more”, or “Don’t Know”. Respondents who selected “Below $35,000” were asked: “We would like to get a better estimate of your total HOUSEHOLD income in the past 12 months before taxes. Was it ...” Respondents could choose: “Less than $5,000”, “$5,000 to $7,499”, “$7,500 to $9,999”, “$10,000 to $12,499”, “$12,500 to $14,999”, “$15,000 to $19,999”, “$20,000 to $24,999”, “$25,000 to $29,999”, or “ $30,000 to $34,999”. Respondents who selected “$35,000 or more” were asked: “We would like to get a better estimate of your total HOUSEHOLD income in the past 12 months before taxes. Was it ...” Respondents could choose: “$35,000 to $39,999”, “$40,000 to $49,999”, “$50,000 to $59,999”, “$60,000 to $74,999”, “$75,000 to $84,999”, ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 40 “$85,000 to $99,999”, “$100,000 to $124,999”, “$125,000 to $149,000”, “$150,000 to $174,999”, or “$175,000 or more”. Responses to all three questions were recoded into eight categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to $49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “More than $150,000”. Ancillary. Ancillary income was coded into categories for “$1,000-$14,999”, “, “$15,000-$24,999”, “$25,000-$34,999”, “$35,000-$49,999”, “$50,000-$74,999”, “$75,000$99,999”, “$100,000-$124,999”, “125,000-$149,999”, “$150,000-$174,999”, “175,000$199,999”, “$200,000-$249,999”, and “$250,000+”. Ancillary income data were recoded into eight categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to $49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “More than $150,000”. Recoding. Responses to household income questions in both sources were recoded into eight categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to $49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “$150,000 or more”. Number of Persons in Household (Ordinal) Self-report. Respondents were asked: “Including yourself, how many people currently live in your household at least 50% of the time? Please remember to include babies or small children, include unrelated individuals (such as roommates), and also include those now away traveling, at school, or in a hospital.” Respondents could enter a number between 1 and 15. Responses indicating more than 5 household members were collapsed into the single category: “5 or more”. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 41 Ancillary. Data were requested on the number of adults in each household, the presence of children in the household, and the number of children in the household. Households that were not listed as having children present were coded 0 for the number of children (N=19,732). The number of adults and children in the household were summed to produce a variable for the total number of persons in the household. Sums indicating more than 5 individuals in the household were collapsed into the single category: “5 or more”. Recoding. Number of persons in the household was coded to range from 1 to 5 in both datasets, with values greater than 5 recoded to equal 5. Number of Children in Household (Ordinal) Self-report. Respondents were asked: “Including yourself, how many people currently live in your household at least 50% of the time? Please remember to include babies or small children, include unrelated individuals (such as roommates), and also include those now away traveling, at school, or in a hospital.” Respondents could enter a number between 1 and 15. Responses indicating more than 5 household members were collapsed into the single category: “5 or more”. Ancillary. Data were requested on the number of adults in each household, the presence of children in the household, and the number of children in the household. Households that were not listed as having children present were coded 0 for the number of children (N=19,732). The number of adults and children in the household were summed to produce a variable for the total number of persons in the household. Sums indicating more than 5 individuals in the household were collapsed into the single category: “5 or more”. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 42 Presence of Telephone (Nominal; 2 categories) Self-report. Respondents were asked: “Is there at least one telephone INSIDE your home that is currently working and is not a cell phone?” Respondents who selected “Yes” were coded 1, all other respondents were coded 0. Ancillary. Phone number matches were requested for all households in the sample. Phone numbers were matched for 11,881 households and could not be matched for 13,119 households. Matched households were coded 1, all other households were coded 0. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 43 Online Appendix B. Question Wordings for Individual-Level Variables Marital Status (Nominal; 2 categories) Self-report. Respondents were asked: “Are you now married, widowed, divorced, separated, never married, or living with a partner?” Response options were “married”, “widowed”, “divorced”, “separated”, “never married”, and “living with partner”. Responses were recoded 1 for “married” and 0 for all others. Ancillary. Data were requested on the marital status of an individual in each household. Ancillary marital status data were categorized 1 for married and 0 for single. Recoding. Responses to marital status from both data sources were coded as 1 for respondents who reported that they were currently married and 0 for all other respondents. Marital status in the ancillary data was reported for individuals who were classified as “heads of household”. This category (as all others in the ancillary data) was not defined. Education Self-report. Respondents were asked: “What is the highest level of school you have completed?” Response options were “no formal education”, “first, second, third, or fourth grade”, “fifth or sixth grade”, “seventh or eighth grade”, “ninth grade”, “tenth grade”, “eleventh grade”, “twelfth grade no diploma”, “high school diploma or the equivalent”, “some college no degree”, “associate degree”, “bachelor degree”, “master degree”, and “professional or doctoral degree”. Ancillary. Data were requested on the education level of an individual in each household. Ancillary education was coded into six categories for “less than high school diploma”, “high school diploma”, “some college”, “bachelor”, “graduate school”, and “Don’t know”. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 44 Recoding. Education levels from both data sources were coded into 5 categories for “Less than High School”, “High School Graduate”, “Some College”, “College Graduate” and “PostGraduate” education levels. Age Self-report. Respondents could enter their age in an open ended way. Ancillary. Data were requested on an individual’s age in each household. Recoding. Ages for all individuals were coded to range from 18 to 90 in analyses 1 and 2 and from 18 to 80 in analysis 3. To facilitate presentation, both data sources were also coded into 7 categories for individuals aged 18-24, 25-34, 35-44, 45-54, 55-64, 65-74, and 75 and older. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 45 Online Appendix C. Weighting Six sets of weights were produced to match the various data sources with the American public and to one-another. These weights served two purposes. We adjusted for differences between individual-level self-reports and household-level ancillary data and we corrected for the stratified sampling procedure used by GfK to select households. Household level weight To produce data that could be compared across sources, we needed to match self-reports to the values in ancillary data. The sampling procedure, however, allowed multiple individuals from a single household to enter the panel. This introduced three potential problems. First, the presence of multiple individuals from a single household introduced concerns about the independence of observations. Second, the results of our analyses might be biased toward households with multiple representatives. And third, it might be possible that ancillary data could correctly match one individual in a household while providing an inaccurate portrait of other individuals in the household. The first and second challenges are easily overcome by weighting observations at the household, rather than individual, level. The third challenge is more pernicious and requires that we consider the conditions under which household and individual data should be considered in agreement. To circumvent these problems, we created six sets of weights for respondents: 1. Pure household weight: (1) the weight was coded as the inverse of the number of respondents in the household (Total unweighted N=4472; weighted N = 2498). ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 46 2. Household adult weight: (1) individuals under 18 were dropped; (2) the weight was coded as the inverse of the number of adult respondents in the household. (Total unweighted N = 4134; weighted N = 2400).16 3. Best respondent match weight: (1) we collected the ages of all respondents in each household; (2) the respondent closest to the age17 indicated by the household ancillary data was selected and all other members of the household were dropped; (3) in households where no respondents were close to the age indicated by the ancillary data or where multiple respondents were equally close, individuals who were clearly not the best match were dropped and the weight was coded as the inverse of the number of remaining individuals in the household (Total unweighted N=3199; weighted N = 2498). 4. Best household match weight generous: (1) respondents were asked to provide the names and ages of all individuals in the household; (2) in households where one individual was closest to the age indicated by the household ancillary data, all other members of the household were dropped (whether or not that individual was a respondent); (3) in households where multiple individuals were equally close or ancillary age information was missing, the weight was coded as the inverse of the number of remaining individuals in the household (Total unweighted N=2010; weighted N = 1166). 5. Best household match weight strict: (1) respondents were asked to provide the names and ages of all individuals in the household; (2) in households where one individual was closest to the age indicated by the household ancillary data, all other members of the household were 98 Households only had respondents under age 18 and were thus dropped from the dataset when these weights were used. 17 Age was used in these circumstances because it was the most commonly available piece of ancillary information and was the only piece of ancillary information that could be consistently expected to discriminate between members of a household. Other variables were either household-level or would be expected to match multiple household members (e.g. marital status). 16 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 47 dropped (whether or not that individual was a respondent); (3) in households where multiple individuals were equally close, the weight was coded as the inverse of the number of remaining individuals in the household; (4) all individuals in households with more than one individual for which ancillary age information was missing were dropped (Total unweighted N=920; weighted N = 909). 6. Sole household match weight: (1) respondents were asked to provide the names and ages of all individuals in the household; (2) in households where one and only one individual was within 5 years of the age indicated by the household ancillary data, all other members of the household were dropped (whether or not that individual was a respondent); (3) all individuals in households with more than one member (as defined by the delineation of household members) where none met this criterion were also dropped.18 An equal weight of 1 was applied for all remaining households (Total unweighted N=672; weighted N = 672). Probability of sampling correction Probability of sampling corrections adjusted for deviation from an equal probability of selection across strata. As part of a procedure to increase the number of respondents in traditionally underrepresented groups, GfK used a stratified sampling technique. Households were categorized into four groups depending on the age and Hispanic status indicated in the ancillary data. Sampling probabilities were assigned to oversample households that included Hispanics or individuals ages 18-24. Because our goal was to assess whether such techniques might improve the survey process, we needed to eliminate any biases that might have been introduced through this sampling procedure. To do so, we used two pieces of information: the This left only individuals who were the sole member of their household within 5 years of the ancillary data age and individuals who were in households with only one member regardless of the match between ancillary and self-reported age. Although this was biasing on a dependent variable, it was a test of how sensitive the conclusions were to the member of the household chosen. 18 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 48 proportion of households out of a random sample of one million with each of the ancillary demographic characteristics considered (population) and the proportion of respondents in each demographic category according to the ancillary data (respondents). Respondent-level weights were calculated to match the characteristics of respondents to those of the population (weight = population proportion / respondent proportion). Best respondent match weights were multiplied by respondent-level weights for all analyses presented. Alternate weighting strategies led to the same general conclusions and are shown in Online Appendix F. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 49 Online Appendix D. Comparison of Self-Report and Ancillary Values Table D1 – Crosstabs Comparing Self-Report and Ancillary Values Home Ownership Self-Report Difference Non-Owner Owner -1 6.78 Ancillary Non-Owner Owner 16.67 6.78 4.18 72.36 0 89.04 1 4.18 Less than $15k 1.55 .48 .65 .88 .37 .14 .00 .09 $15k-25k 2.41 .83 .90 1.01 .39 .11 .28 .17 $25k-35k 2.06 1.73 .29 2.00 1.46 .40 .51 .00 $35k-50k 2.54 1.47 1.97 3.35 2.61 2.02 .81 .26 $50k-75k 2.31 2.71 2.38 4.80 6.61 2.89 2.66 .16 $75k-100k .72 1.00 1.18 2.50 4.72 3.65 3.05 .95 $100k-150k .37 .41 .54 2.37 3.78 4.06 3.92 1.54 -7 .26 -6 .54 -5 1.39 -4 4.21 -3 10.32 -2 13.02 -1 21.36 0 22.12 1 13.46 2 8.75 3 2.64 4 1.25 5 .42 6 .17 7 .09 1 8.11 8.37 4.42 2.95 1.21 2 6.81 10.90 4.24 3.06 2.05 Ancillary 3 4.61 7.04 4.23 2.90 1.89 4 2.49 3.28 2.83 2.07 1.87 5 or more 2.46 4.71 2.00 2.99 2.50 -3 7.20 -2 9.89 -1 19.67 0 27.82 1 17.39 Household Income Self-Report Difference Difference (cont.) Ancillary Less than $15k $15k-25k $25k-35k $35k-50k $50k-75k $75k-100k $100k-150k More than $150k Household Size Self-Report Difference 1 2 3 4 5 or more -4 2.46 Marital Status Self-Report Difference Non-Married Married -1 20.11 Self-Report Difference -4 .84 2 9.37 3 5.00 4 1.21 2 5.05 3 1.01 4 .07 Ancillary Non-Married Married 18.72 20.11 6.80 54.38 0 73.09 1 6.80 Less than HS 6.59 4.46 2.80 .88 .07 HS Grad 8.87 8.14 5.98 1.50 .13 Ancillary Some College 6.44 11.64 9.71 1.88 .75 College Grad 2.24 3.54 11.01 2.06 1.12 Grad School .84 .98 4.59 3.06 .73 -3 3.22 -2 14.56 -1 34.58 0 27.23 1 13.44 Education Less than HS HS Grad Some College College Grad Grad School More than $150k .26 .17 .26 .36 1.53 .83 1.66 1.92 Matches Bolded. Numbers are percentages of pairwise complete wtd. N (see Table 2). All values weighted using best respondent weights. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 50 Figure D1 – Comparisons of Self-Reported and Ancillary Age 50 40 20 30 Ancillary Data Age 60 70 80 Comparing Age in Ancillary Data with Self−Reported Measures Wtd. N = 1782 Wtd. Cor = .73 20 30 40 50 60 70 80 Self−Reported Age Points are jittered to show density. Dashed line indicates 5-year margin. Weighted using best respondent match weights. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 51 Online Appendix E. Considerations for Using Consumer File Ancillary Data in Practice Targeted data collection. The more accurate of the ancillary measures may be particularly useful for oversampling individuals in hard-to-reach and low response rate groups. Researchers adopting these strategies should carefully consider the bias-variance tradeoff of their choices, however. An additional step in weighting can lead to increases in variance when oversampling with the assistance of ancillary data (cf. Santos, 1991; Winship and Radbill, 1994). Also, researchers employing sample designs with an over-sampled stratum defined by ancillary information should always include an “all else” stratum to help mitigate bias and only establish class eligibility based on the self-reported information and not the ancillary information. Because ancillary data measures do not perfectly mirror self-reports, proper weighting of an ancillary-generated oversample requires a multi-step process. First, researchers need to stratify their sample into oversampled and non-oversampled groups based on the ancillary data. Second, after sampling, data collected from the two samples need to be assigned a base weight such that the ancillary variables match their pre-stratification proportion of the sampling frame. This second step is necessary because some individuals in the target group (e.g. Hispanics) might not be captured by the oversample (due to inaccuracies in the ancillary data). If these individuals represent a systematic type of respondent, post-stratification without this adjustment would overrepresent members of the target group who were correctly identified in the ancillary data and would under-represent members of the target group who were incorrectly identified in the ancillary data. Third, post-hoc correctives should be applied to the base weights to produce a sample that matches known population parameters (Deville et al., 1993). Nonresponse adjustment. Effective non-response adjustment using consumer file data depends on the extent to which that data can effectively discriminate between the individuals ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 52 who do and do not respond to the survey. Technically, the ability for auxiliary data to distinguish between those who do and do not respond a survey is not dependent on the accuracy of the ancillary data. Instead, it depends only on whether the auxiliary data can reliably distinguish between respondents and nonrespondents and that the relations between auxiliary and self-report data among respondents are identical to those among nonresponents. Directly assessing this question is a job for future research, but the results here should concern those hoping to make such a correction. Inconsistent correspondence between consumer file data and self-reports is indicative of a variety of biases that could undermine corrective tools. Results indicating nonignorable missingness in particular imply that correctives may be differentially accurate depending on the actual levels of particular variables. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 53 Online Appendix F. Analyses Using All Weighting Strategies Table F1. Descriptive Characteristics of Respondents for Six Demographic Variables and the Number of Cases Missing Data Table F1a – Unweighted, with Probability of Sampling Correction Only (Respondent N=4472) Self-Report Data Percent Home Owner 69.9% Non-Owner 30.1% Income < 15k 12.7% Income 15k – 25k Missing data (count) 1103 Ancillary Data (Respondents only) Missing data Percent (count) 594 79.4% 20.6% 1531 76.9% 326 4.9% 8.3% 9.1% 9.3% 11.2% 16.8% 15.6% 16.8% 21.3% 23.5% 23.5% Income 75k – 100k 13.3% 17.4% 14.9% Income 100k – 150k 12.9% 16.7% 13.5% 5.2% 7.1% 6.9% Income 35k – 50k Income 50k – 75k Income > 150k Household size – 1 14.9% Household size – 2 32.3% 26.9% 27.3% Household size – 3 20.7% 19.8% 16.6% Household size – 4 19.4% 13.3% 10.2% Household size > 4 12.7% 15.7% 12.0% Married 55.6% Not Married 44.4% Education - Less than HS 30.2% Education – HS 27.6% 25.9% 23.0% Education – Some College 30.6% 29.5% 28.7% Education - Bachelors 8.8% 17.6% 14.9% Education – Post Grad 2.8% 10.2% 8.8% Age – mean 43.3 years (sd=17.4y) 8 1811 24.2% 77.1% 167 986 22.9% 1811 8 16.7% 51.0 years (sd=13.7y) 4456 23.1% 6.6% Income 25k – 35k 8.6% 3.8% Ancillary Data (All sampled cases) Missing data Percent (count) 33.9% 71.1% 2591 2068 6903 28.9% 910 1272 24.6% 50.5 years (sd=16.2y) 5392 8660 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 54 Table F1b – Pure Household Weight (N=2498) Self-Report Data Percent Home Owner 69.6% Non-Owner 30.4% Income < 15k 13.1% Missing data (count) 599 Ancillary Data (Respondents only) Missing data Percent (count) 325 79.4% 20.6% 850 3.8% Ancillary Data (All sampled cases) Missing data Percent (count) 76.9% 23.1% 187 4.9% Income 15k – 25k 9.3% 6.6% 8.3% Income 25k – 35k 9.1% 9.3% 11.2% Income 35k – 50k 17.2% 15.6% 16.8% Income 50k – 75k 20.7% 23.5% 23.5% Income 75k – 100k 13.5% 17.4% 14.9% Income 100k – 150k 12.2% 16.7% 13.5% 4.9% 7.1% 6.9% Income > 150k Household size – 1 24.6% Household size – 2 34.0% 26.9% 27.3% Household size – 3 18.1% 19.8% 16.6% Household size – 4 14.1% 13.3% 10.2% Household size > 4 9.3% 15.7% 12.0% 4 24.2% 85 33.9% Married 52.0% Not Married 48.0% Education - Less than HS 26.7% Education – HS 27.9% 25.9% 23.0% Education – Some College 32.3% 29.5% 28.7% Education - Bachelors 10.2% 17.6% 14.9% Education – Post Grad 3.0% 10.2% 8.8% Age – mean 46.8 years (sd=15.9y) 884 77.1% 585 22.9% 830 4 16.7% 51.9 years (sd=14.1y) 4456 71.1% 2591 2068 6903 28.9% 510 680 24.6% 51.5 years (sd=16.2y) 5392 8660 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 55 Table F1c –Household Adult Weight (N=2400) Self-Report Data Percent Home Owner 69.9% Non-Owner 30.1% Income < 15k 13.0% Missing data (count) 1103 Ancillary Data (Respondents only) Missing data Percent (count) 594 79.3% 20.7% 1531 4.3% Ancillary Data (All sampled cases) Missing data Percent (count) 76.9% 23.1% 326 4.9% Income 15k – 25k 9.2% 6.6% 8.3% Income 25k – 35k 9.1% 9.1% 11.2% Income 35k – 50k 17.3% 15.3% 16.8% Income 50k – 75k 20.5% 24.9% 23.5% Income 75k – 100k 13.8% 16.5% 14.9% Income 100k – 150k 12.3% 16.2% 13.5% 4.8% 7.1% 6.9% Income > 150k Household size – 1 25.5% Household size – 2 34.8% 26.7% 27.3% Household size – 3 17.6% 20.6% 16.6% Household size – 4 13.3% 12.3% 10.2% Household size > 4 8.8% 14.7% 12.0% 8 25.8% 33.9% 53.6% Not Married 46.4% Education - Less than HS 24.5% Education – HS 28.7% 25.3% 23.0% Education – Some College 33.2% 29.4% 28.7% Education - Bachelors 10.4% 18.7% 14.9% Education – Post Grad 3.1% 10.3% 8.8% Age – mean 43.3 years (sd=17.4y) 74.9% 167 Married 1811 986 25.1% 1811 8 16.2% 51.0 years (sd=13.7y) 4456 71.1% 2591 2068 6903 28.9% 910 1272 24.6% 50.5 years (sd=16.2y) 5392 8660 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 56 Table F1d – Best Respondent Match Weight (N=2498) Self-Report Data Percent Home Owner 69.6% Non-Owner 30.4% Income < 15k 13.1% Missing data (count) 626 Ancillary Data (Respondents only) Missing data Percent (count) 339 79.2% 20.8% 850 4.3% Ancillary Data (All sampled cases) Missing data Percent (count) 76.9% 23.1% 194 4.9% Income 15k – 25k 9.3% 6.6% 8.3% Income 25k – 35k 9.1% 9.2% 11.2% Income 35k – 50k 17.2% 15.4% 16.8% Income 50k – 75k 20.7% 24.7% 23.5% Income 75k – 100k 13.5% 16.6% 14.9% Income 100k – 150k 12.2% 16.1% 13.5% 4.9% 7.1% 6.9% Income > 150k Household size – 1 24.6% Household size – 2 34.0% 27.0% 27.3% Household size – 3 18.1% 20.6% 16.6% Household size – 4 14.1% 12.6% 10.2% Household size > 4 9.3% 14.6% 12.0% 4 25.1% 33.9% 53.6% Not Married 46.4% Education - Less than HS 24.7% Education – HS 28.8% 25.5% 23.0% Education – Some College 33.5% 29.3% 28.7% Education - Bachelors 10.1% 18.7% 14.9% Education – Post Grad 2.9% 10.1% 8.8% Age – mean 47.4 years (sd=15.7y) 75.1% 90 Married 866 604 24.9% 866 4 16.3% 51.7 years (sd=14.0y) 4456 71.1% 2591 2068 6903 28.9% 531 712 24.6% 50.5 years (sd=16.2y) 5392 8660 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 57 Table F1e – Best Household Match Weight Generous (N=1166) Self-Report Data Percent Missing data (count) 314 Ancillary Data (Respondents only) Missing data Percent (count) 280 68.2% Ancillary Data (All sampled cases) Missing data Percent (count) Home Owner 56.4% Non-Owner 43.6% Income < 15k 17.5% Income 15k – 25k 10.4% 9.4% 8.3% Income 25k – 35k 12.3% 11.0% 11.2% Income 35k – 50k 17.2% 16.2% 16.8% Income 50k – 75k 17.3% 22.3% 23.5% Income 75k – 100k 11.1% 14.5% 14.9% Income 100k – 150k 9.8% 13.1% 13.5% Income > 150k 4.4% 6.0% 6.9% 31.8% 423 7.4% 76.9% 23.1% 190 4.9% Household size – 1 29.0% Household size – 2 33.7% 29.7% 27.3% Household size – 3 16.3% 20.6% 16.6% Household size – 4 13.3% 10.5% 10.2% Household size > 4 7.7% 11.4% 12.0% 2 27.8% 33.9% 44.6% Not Married 55.4% Education - Less than HS 27.9% Education – HS 27.3% 23.6% 23.0% Education – Some College 30.6% 29.6% 28.7% Education - Bachelors 11.2% 16.9% 14.9% Education – Post Grad 2.9% 7.5% 8.8% Age – mean 56.4% 44.1 years (sd=16.4y) 61.4% 48 Married 426 379 38.6% 426 2 22.5% 52.0 years (sd=14.7y) 4456 71.1% 2591 2068 6903 28.9% 309 688 24.6% 50.5 years (sd=16.2y) 5392 8660 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 58 Table F1f – Best Household Match Weight Strict (N=909) Self-Report Data Percent Home Owner 71.1% Non-Owner 28.9% Income < 15k 13.4% Missing data (count) 267 Ancillary Data (Respondents only) Missing data Percent (count) 108 79.3% 20.7% 323 4.8% Ancillary Data (All sampled cases) Missing data Percent (count) 76.9% 23.1% 64 4.9% Income 15k – 25k 7.4% 5.9% 8.3% Income 25k – 35k 9.8% 8.8% 11.2% Income 35k – 50k 17.7% 15.5% 16.8% Income 50k – 75k 20.2% 23.5% 23.5% Income 75k – 100k 14.0% 17.6% 14.9% Income 100k – 150k 12.4% 16.0% 13.5% 5.0% 7.9% 6.9% Income > 150k Household size – 1 37.4% Household size – 2 31.0% 26.2% 27.3% Household size – 3 12.0% 21.4% 16.6% Household size – 4 12.6% 12.4% 10.2% Household size > 4 7.0% 14.9% 12.0% 5 25.1% 33.9% 51.6% Not Married 48.4% Education - Less than HS 18.8% Education – HS 31.2% 25.4% 23.0% Education – Some College 34.3% 30.9% 28.7% Education - Bachelors 12.9% 20.1% 14.9% Education – Post Grad 2.8% 10.3% 8.8% 49.7 years (sd=15.2) 51.5 years (sd=14.5y) 76.9% 50.5 years (sd=16.2y) Age – mean 74.8% 38 Married 374 189 25.2% 374 5 13.4% 4456 71.1% 2591 2068 6903 28.9% 184 198 24.6% 5392 8660 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 59 Table F1g – Sole Household Match Weight (N=672) Self-Report Data Percent Home Owner 64.1% Non-Owner 35.9% Income < 15k 17.0% Income 15k – 25k Missing data (count) 267 Ancillary Data (Respondents only) Missing data Percent (count) 100 76.2% 23.8% 292 9.3% 5.6% Ancillary Data (All sampled cases) Missing data Percent (count) 76.9% 23.1% 64 7.3% 4.9% 11.4% 11.0% 11.2% Income 35k – 50k 20.2% 16.8% 16.8% Income 50k – 75k 17.8% 24.4% 23.5% Income 75k – 100k 10.6% 14.6% 14.9% Income 100k – 150k 10.1% 13.5% 13.5% 3.6% 6.9% 6.9% Household size – 1 49.6% Household size – 2 27.3% 26.8% 27.3% Household size – 3 10.7% 21.5% 16.6% Household size – 4 8.7% 11.6% 10.2% Household size > 4 3.8% 12.4% 12.0% 5 27.7% 33.9% 27.1% Not Married 72.9% Education - Less than HS 18.8% Education – HS 31.1% 25.4% 23.0% Education – Some College 31.9% 30.1% 28.7% Education - Bachelors 14.7% 19.7% 14.9% Education – Post Grad 3.5% 10.7% 8.8% Age – mean 50.3 years (sd=15.5y) 68.5% 30 Married 376 175 31.5% 376 5 14.2% 52.9 years (sd=14.5y) 2591 8.3% Income 25k – 35k Income > 150k 4456 71.1% 2068 6903 28.9% 138 196 24.6% 50.5 years (sd=16.2y) 5392 8660 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 60 Table 2. Variable Value Comparisons between Survey and Ancillary Data (% of Respondents) Table F2a – Unweighted, with Probability of Sampling Correction Only % Survey < Ancillary % Survey = Ancillary % Survey > Ancillary Total % Far-off cases n Corr. (r) Homeownership 6.5% Household Income 51.2% 89.5% 4.0% 22.0% 26.9% 100.0% -- 2925 .70 100.0% 42.9% 2734 .51 Household Size 33.6% 25.6% 40.8% 100.0% 36.6% 4297 .18 Marital Status 23.3% Education 69.4% 7.2% 100.0% -- 2107 .32 Age* 55.4% 26.4% 18.2% 100.0% 27.1% 2104 .36 38.9% 46.0% 15.1% 100.0% 37.5% 3192 .53 % Survey < Ancillary % Survey = Ancillary % Survey > Ancillary Total % Far-off cases n Corr. (r) Homeownership 6.8% 89.0% 4.2% 100.0% -- 1621 .68 Household Income 51.1% 22.1% 26.8% 100.0% 43.1% 1524 .48 Household Size 39.1% 27.3% 33.6% 100.0% 35.4% 2406 .18 Marital Status 22.7% 70.7% 6.6% 100.0% -- 1230 .38 Education 53.4% 26.7% 19.9% 100.0% 26.7% 1261 .35 Age* 33.9% 50.4% 15.7% 100.0% 32.5% 1782 .58 % Survey < Ancillary % Survey = Ancillary % Survey > Ancillary Total % Far-off cases n Corr. (r) Homeownership 6.7% 89.0% 4.3% 100.0% -- 1559 .68 Household Income 50.8% 22.3% 26.9% 100.0% 42.9% 1465 .48 Household Size 39.8% 27.5% 32.8% 100.0% 35.1% 2311 .18 Marital Status 20.9% 72.3% 6.8% 100.0% -- 1193 .40 Education 52.5% 27.0% 20.4% 100.0% 26.1% 1229 .36 Age* 31.4% 52.4% 16.3% 100.0% 29.9% 1716 .60 Table F2b – Pure Household Weight Table F2c –Household Adult Weight ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 61 Table F2d – Best Respondent Match Weight % Survey < Ancillary % Survey = Ancillary % Survey > Ancillary Total % Far-off cases n Corr. (r) 1620 .68 Homeownership 6.8% 89.0% 4.2% 100.0% Household Income 51.1% 22.1% 26.8% 100.0% 43.1% 1524 .48 Household Size 39.2% 27.8% 33.0% 100.0% 35.1% 2404 .19 Marital Status 20.1% 73.1% 6.8% 100.0% 1241 .41 Education 53.2% 27.2% 19.6% 100.0% 24.7% 1275 .39 Age* 20.8% 67.0% 12.2% 100.0% 19.9% 1782 .73 -- -- Table F2e – Best Household Match Weight Generous % Survey < Ancillary % Survey = Ancillary % Survey > Ancillary Total % Far-off cases n Corr. (r) Homeownership 7.3% 86.2% 6.5% 100.0% -- 651 .69 Household Income 48.5% 22.6% 28.9% 100.0% 43.4% 621 .53 Household Size 38.5% 27.2% 34.2% 100.0% 35.4% 1115 .14 Marital Status 19.8% 69.4% 10.8% 100.0% -- 499 .39 Education 46.4% 30.3% 23.4% 100.0% 23.9% 540 .39 Age* 8.1% 86.4% 5.5% 100.0% 9.4% 476 .83 Table F2f – Best Household Match Weight Strict % Survey < Ancillary % Survey = Ancillary % Survey > Ancillary Total % Far-off cases n Corr. (r) Homeownership 4.7% 92.4% 2.9% 100.0% -- 573 .78 Household Income 47.9% 22.5% 29.6% 100.0% 42.7% 549 .49 Household Size 46.4% 30.3% 23.2% 100.0% 31.1% 866 .31 Marital Status 19.1% 75.6% 5.3% 100.0% -- 423 .49 Education 51.2% 28.6% 20.1% 100.0% 22.5% 422 .39 Age* 7.4% 87.5% 5.1% 100.0% 8.5% 707 .86 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 62 Table F2g – Sole Household Match Weight % Survey < Ancillary % Survey = Ancillary % Survey > Ancillary Total % Far-off cases n Corr. (r) Homeownership 5.6% 90.0% 4.4% 100.0% -- 342 .74 Household Income 48.2% 21.4% 30.4% 100.0% 46.7% 344 .45 Household Size 52.0% 29.7% 18.3% 100.0% 33.7% 637 .23 Marital Status 26.6% 68.5% 4.9% 100.0% -- 198 .43 Education 48.4% 31.8% 19.8% 100.0% 21.8% 231 .42 Age* 4.6% 92.7% 2.7% 100.0% 3.3% 471 .94 Note: Far-off cases are counted when values from self-reported survey and ancillary data differ by more than one category (household income, household size, and education) or more than five years (for ages). Far off cases were not computed for dichotomous variables. Ages within one year were considered equivalent. N is the weighted overlap of non-missing cases between ancillary and self-report measures. ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 63 Table 3. Logistic Regressions Predicting Missing Ancillary Data with Self-Reports Table F3a – Unweighted, with Probability of Sampling Correction Only Missing Home Ownership -1.37*** (.11) Missing Household Income -1.24*** (.17) Missing Household Size -.10 (.26) Missing Marital Status -.54*** (.09) -.06 (.17) .03 (.17) -.35* (.16) -.44+ (.23) -.49* (.22) -.78** (.24) -.53 (.38) .05 (.28) .55* (.25) .01 (.26) -.23 (.33) -.13 (.30) -.24 (.35) -.10 (.45) -.44 (.50) .36 (.37) .28 (.36) .40 (.36) .07 (.40) .39 (.44) .51 (.57) -.13 (.16) -.24 (.17) -.34+ (.18) -.17 (.19) -.36+ (.21) -.59* (.23) -.69** (.24) -1.65*** (.32) Married -.19 (.12) Education - High School Degree Self-Report Variable Home Owner Income - $15,000-24,999 Income - $25,000-34,999 Income - $35,000-49,999 Income - $50,000-74,999 Income - $75,000-99,999 Income - $100,000-149,999 Income - $150,000 or more 2 Persons in Household 3 Persons in Household 4 Persons in Household 5 Persons in Household Education - Some College Education - College Degree Education - Graduate Degree Age Age 2 Intercept McFadden's Pseudo R 2 / R 2 -2 Log Likelihoood Percent Correctly Predicted (PCP) Null Percent Correctly predicted N Missing Education Missing Age -.06 (.10) -1.15*** (.08) Number of Missing Variables -.59*** (.04) .11 (.17) .05 (.17) -.08 (.15) -.23 (.15) -.34+ (.18) -.26 (.17) -.45 (.31) -.07 (.19) .27 (.18) .35+ (.19) .29+ (.16) .32+ (.19) .22 (.19) .59* (.26) .28+ (.16) .30+ (.16) .09 (.15) -.15 (.14) -.26 (.17) -.45 (.25) -.02 (.28) .05 (.08) .15+ (.08) .01 (.07) -.08 (.07) -.11 (.07) -.15+ (.08) -.02 (.13) -.22 (.31) .57+ (.30) .24 (.32) -.63 (.40) -.30* (.12) -.43** (.13) -.49*** (.14) -.98*** (.16) .15 (.14) -.12 (.15) -.34* (.16) -.27 (.17) -.19 (.13) -.46** (.14) -.26+ (.15) -.59*** (.15) -.11+ (.06) -.22*** (.06) -.25*** (.07) -.41*** (.07) -.10 (.20) -.17 (.24) -.42*** (.10) .02 (.12) .10 (.10) -.06 (.04) .04 (.14) -.13 (.19) -.11 (.23) -.39 (.46) .02 (.22) .08 (.24) -.11 (.39) .01 (.46) -.05 (.35) -.18 (.33) .002 (.40) .25 (.48) -.08 (.11) -.02 (.13) -.15 (.17) -.13 (.29) .01 (.13) -.09 (.12) .10 (.21) .04 (.30) -.34** (.10) -.22+ (.11) -.03 (.22) -.08 (.24) -.06 (.05) -.07 (.06) -.02 (.10) -.04 (.11) .01 (.02) -3e-04+ (2e-04) .02 (.02) -4e-04 (3e-04) -.005 (.02) 1e-04 (3e-04) -.01 (.01) 1e-04 (1e-04) -.004 (.01) 1e-04 (1e-04) .03* (.01) -.001*** (1e-04) .002 (.01) -1e-04 (1e-04) -.57 (.35) -1.70*** (.50) -3.36*** (.62) -.27 (.28) -1.63*** (.31) .23 (.29) 1.65*** (.13) .11 3027.4 .87 .87 4472 .09 1674.4 .95 .95 4472 .03 1328.2 .96 .96 4472 .05 4311.0 .80 .80 4472 .01 4149.6 .82 .82 4472 .11 4778.7 .74 .71 4472 .09 ---4472 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 64 Table F3b – Pure Household Weight Missing Home Ownership Missing Household Income Missing Household Size Missing Marital Status Missing Education Missing Age -1.30*** (.15) -1.30*** (.22) -.01 (.29) -.53*** (.12) -.02 (.13) -1.07*** (.11) Number of Missing Variables (OLS) -.57*** (.05) .07 (.22) .10 (.24) -.36 (.24) -.36 (.27) -.57* (.29) -.70* (.34) -.27 (.39) .08 (.33) .52 (.34) .06 (.32) -.16 (.35) -.12 (.46) -.13 (.45) .12 (.60) -.56 (.74) .18 (.51) .33 (.47) .27 (.51) -.02 (.54) .32 (.55) .34 (.69) .02 (.22) -.08 (.21) -.20 (.20) -.22 (.18) -.39+ (.21) -.27 (.26) -.54 (.37) .01 (.26) .36 (.28) .31 (.21) .28 (.20) .28 (.25) .08 (.23) .65* (.28) .06 (.20) .16 (.22) -.14 (.23) -.35 (.22) -.43* (.20) -.49* (.24) -.20 (.27) .01 (.08) .11 (.08) -.06 (.09) -.11 (.09) -.17+ (.09) -.18+ (.10) -.04 (.12) -.14 (.19) -.27 (.21) -.45+ (.24) -.42 (.26) -.26 (.25) -.46 (.28) -.78* (.34) -1.64*** (.47) -.34 (.35) .39 (.36) -.10 (.42) -1.04+ (.60) -.16 (.14) -.32+ (.17) -.48* (.19) -.87*** (.22) .22 (.16) -.005 (.18) -.25 (.21) -.04 (.22) -.19 (.15) -.46** (.17) -.45* (.18) -.52** (.19) -.08 (.05) -.19** (.06) -.29*** (.07) -.38*** (.07) Married -.22 (.17) -.29 (.26) -.03 (.29) -.54*** (.13) -.03 (.14) .04 (.12) -.11* (.04) Education - High School Degree .01 (.18) -.28 (.19) -.19 (.29) -.26 (.45) .15 (.26) -.01 (.26) -.02 (.37) .32 (.50) .06 (.35) -.21 (.36) .10 (.51) .26 (.78) -.03 (.14) -.12 (.15) -.19 (.24) -.03 (.34) .12 (.14) -.01 (.14) .20 (.20) .11 (.33) -.36* (.15) -.26+ (.13) -.03 (.22) -.15 (.35) -.03 (.05) -.09+ (.05) -.01 (.08) .002 (.12) .004 (.02) -3e-04 (2e-04) .02 (.03) -.001 (4e-04) -.01 (.04) 1e-04 (4e-04) -.01 (.02) 2e-04 (2e-04) .001 (.02) 4e-05 (2e-04) .01 (.02) -4e-04* (2e-04) -.01 (.01) .00000 (1e-04) Intercept -.20 (.47) -1.66* (.68) -3.09*** (.85) -.08 (.38) -1.90*** (.41) .91* (.38) 1.88*** (.13) McFadden's Pseudo R 2 / R 2 -2 Log Likelihoood Percent Correctly Predicted (PCP) Null Percent Correctly predicted N .14 851.2 .87 .87 4472 .14 524.4 .94 .94 4472 .04 331.9 .97 .97 4472 .06 1338.7 .77 .77 4472 .01 1197.6 .81 .81 4472 .13 1261.7 .74 .71 4472 .10 ---4472 Self-Report Variable Home Owner Income - $15,000-24,999 Income - $25,000-34,999 Income - $35,000-49,999 Income - $50,000-74,999 Income - $75,000-99,999 Income - $100,000-149,999 Income - $150,000 or more 2 Persons in Household 3 Persons in Household 4 Persons in Household 5 Persons in Household Education - Some College Education - College Degree Education - Graduate Degree Age Age 2 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 65 Table F3c –Household Adult Weight Missing Home Ownership Missing Household Income Missing Household Size Missing Marital Status Missing Education Missing Age -1.24*** (.15) -1.27*** (.23) .07 (.28) -.46*** (.13) .04 (.14) -1.05*** (.11) Number of Missing Variables (OLS) -.53*** (.05) .14 (.23) .17 (.25) -.28 (.26) -.28 (.28) -.52+ (.29) -.88* (.37) -.68 (.53) .06 (.33) .49 (.33) -.01 (.37) -.11 (.38) -.15 (.41) -.36 (.46) -.60 (.70) -.45 (.68) .06 (.56) .39 (.49) .39 (.42) -.05 (.57) .39 (.48) .44 (.62) .11 (.20) .09 (.20) -.11 (.20) -.13 (.21) -.33 (.23) -.33 (.27) -.64+ (.36) .05 (.24) .21 (.23) .25 (.23) .09 (.23) .16 (.25) -.04 (.25) .48 (.38) .06 (.22) .20 (.19) -.12 (.20) -.33 (.20) -.44+ (.22) -.47+ (.24) -.20 (.28) .04 (.08) .14+ (.08) -.04 (.08) -.10 (.09) -.17+ (.10) -.21* (.10) -.11 (.10) -.14 (.18) -.24 (.21) -.42+ (.24) -.40 (.26) -.29 (.24) -.47+ (.28) -.87* (.34) -1.66*** (.49) -.31 (.35) .46 (.35) -.09 (.43) -1.06+ (.62) -.17 (.14) -.28+ (.17) -.50** (.19) -.85*** (.23) .18 (.16) -.04 (.19) -.31 (.22) -.11 (.23) -.19 (.14) -.43* (.17) -.45* (.18) -.42* (.20) -.08 (.05) -.17** (.06) -.30*** (.07) -.37*** (.08) Married -.24 (.16) -.24 (.25) -.13 (.28) -.55*** (.14) .03 (.15) -.01 (.13) -.11* (.05) Education - High School Degree -.02 (.17) -.23 (.22) -.15 (.27) .02 (.40) .05 (.27) .001 (.31) -.01 (.38) .42 (.59) .02 (.32) -.29 (.36) .08 (.48) .27 (.68) -.07 (.14) -.13 (.14) -.16 (.21) -.08 (.34) .12 (.17) -.004 (.17) .24 (.21) .13 (.35) -.34* (.16) -.25 (.15) -.02 (.20) -.06 (.36) -.05 (.05) -.09 (.06) .004 (.08) .05 (.14) -.02 (.02) 2e-05 (3e-04) .01 (.04) -3e-04 (4e-04) -.03 (.04) 2e-04 (4e-04) -.03+ (.02) 4e-04* (2e-04) .01 (.02) -1e-05 (2e-04) -.01 (.02) -3e-04 (2e-04) -.02* (.01) 1e-04 (1e-04) Intercept .34 (.53) -1.22 (.78) -2.76** (.97) .30 (.42) -1.96*** (.47) 1.18** (.43) 2.10*** (.15) McFadden's Pseudo R 2 / R 2 -2 Log Likelihoood Percent Correctly Predicted (PCP) Null Percent Correctly predicted N .15 848.2 .87 .87 4134 .15 519.5 .94 .94 4134 .03 335.5 .97 .97 4134 .06 1345.2 .77 .77 4134 .01 1202.4 .81 .81 4134 .14 1268.1 .74 .72 4134 .10 ---4134 Self-Report Variable Home Owner Income - $15,000-24,999 Income - $25,000-34,999 Income - $35,000-49,999 Income - $50,000-74,999 Income - $75,000-99,999 Income - $100,000-149,999 Income - $150,000 or more 2 Persons in Household 3 Persons in Household 4 Persons in Household 5 Persons in Household Education - Some College Education - College Degree Education - Graduate Degree Age Age 2 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 66 Table F3d – Best Respondent Match Weight Missing Education Missing Age -.02 (.15) -.96*** (.11) Number of Missing Variables -.52*** (.05) .004 (.19) -.11 (.21) -.24 (.20) -.30+ (.18) -.44+ (.23) -.38 (.23) -.51+ (.30) .10 (.31) .39 (.32) .42+ (.21) .29 (.24) .36 (.32) .27 (.28) .64+ (.34) .13 (.19) .17 (.23) -.09 (.21) -.32+ (.18) -.48* (.22) -.51* (.23) -.19 (.28) .03 (.10) .10 (.10) -.05 (.08) -.14+ (.08) -.18* (.09) -.18* (.09) -.08 (.12) -.28 (.35) .38 (.36) .10 (.42) -.89 (.58) -.14 (.14) -.30+ (.17) -.43* (.19) -.82*** (.23) .21 (.16) -.04 (.18) -.27 (.21) -.07 (.22) -.19 (.15) -.48** (.17) -.46* (.18) -.56** (.20) -.07 (.06) -.18* (.07) -.28*** (.08) -.38*** (.08) -.20 (.25) .02 (.30) -.56*** (.15) -.01 (.13) .04 (.13) -.10+ (.05) .05 (.20) -.17 (.18) -.05 (.31) -.07 (.48) .08 (.27) -.05 (.28) .10 (.36) .34 (.56) -.03 (.36) -.30 (.37) .17 (.43) .04 (.92) -.07 (.14) -.15 (.14) -.12 (.21) -.09 (.35) .06 (.17) -.06 (.18) .16 (.20) .12 (.33) -.44** (.14) -.34* (.14) -.12 (.22) -.04 (.32) -.06 (.06) -.11+ (.06) -.003 (.09) .02 (.15) -.03 (.02) .001 (.003) -.01 (.03) -.001 (.004) -.05 (.04) .001 (.004) -.03 (.02) .004* (.002) .005 (.02) .0000 (.002) -.08*** (.02) .004* (.002) -.03*** (.01) .003*** (.001) .57 (.49) -.78 (.69) -2.08* (.88) .19 (.40) -1.99*** (.45) 2.77*** (.40) 2.49*** (.17) .16 1137.1 .87 .87 3199 .16 588.1 .94 .94 3199 .03 585.9 .96 .96 3199 .06 1970.3 .77 .77 3199 .02 1850.9 .81 .81 3199 .13 1712.8 .75 .71 3199 .12 ---3199 Missing Home Ownership -1.25*** (.15) Missing Household Income -1.28*** (.22) Missing Household Size -.04 (.33) Missing Marital Status -.47*** (.12) .13 (.26) .10 (.25) -.34 (.25) -.44+ (.26) -.55+ (.30) -.77+ (.39) -.62 (.48) .07 (.35) .52 (.33) .02 (.31) -.22 (.38) -.20 (.38) -.31 (.64) -.18 (.69) -.64 (.70) -.10 (.57) .17 (.45) .02 (.53) -.27 (.74) .11 (.65) .22 (.84) -.10 (.19) -.24 (.21) -.41+ (.23) -.38 (.25) -.28 (.24) -.48+ (.28) -.81* (.33) -1.68*** (.47) Married -.22 (.18) Education - High School Degree Self-Report Variable Home Owner Income - $15,000-24,999 Income - $25,000-34,999 Income - $35,000-49,999 Income - $50,000-74,999 Income - $75,000-99,999 Income - $100,000-149,999 Income - $150,000 or more 2 Persons in Household 3 Persons in Household 4 Persons in Household 5 Persons in Household Education - Some College Education - College Degree Education - Graduate Degree Age Age 2 Intercept McFadden's Pseudo R 2 / R 2 -2 Log Likelihoood Percent Correctly Predicted (PCP) Null Percent Correctly predicted N ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 67 Table F3e – Best Household Match Weight Generous Missing Home Ownership Missing Household Income Missing Household Size Missing Marital Status Missing Education Missing Age -1.00*** (.21) -1.03*** (.22) .33 (.35) -.51*** (.15) .05 (.16) -.99*** (.20) Number of Missing Variables (OLS) -.55*** (.09) .19 (.26) .17 (.26) -.16 (.28) -.11 (.29) -.24 (.33) -.53 (.42) -.16 (.50) .04 (.38) .45 (.30) -.01 (.36) .14 (.33) .17 (.42) -.15 (.50) .01 (.65) -.07 (.93) .33 (.67) .66 (.85) .09 (1.03) -.06 (.87) -.10 (.86) .47 (.88) .19 (.26) .16 (.29) -.11 (.23) -.01 (.30) -.19 (.30) -.24 (.34) -.19 (.57) -.08 (.35) .36 (.28) .65* (.28) .58* (.27) .55+ (.32) .36 (.35) .92* (.44) .49 (.32) .30 (.34) -.04 (.29) -.13 (.27) -.38 (.30) -.38 (.31) -.05 (.46) .15 (.15) .24* (.12) .09 (.11) .06 (.13) -.04 (.14) -.14 (.15) .14 (.22) -.10 (.22) -.11 (.25) -.47+ (.28) -.43 (.32) -.20 (.26) -.19 (.29) -.63+ (.34) -1.29** (.50) -.45 (.43) -.10 (.47) -.14 (.52) -1.99+ (1.08) -.18 (.19) -.35 (.23) -.43+ (.25) -.92** (.31) .11 (.20) .15 (.24) -.15 (.27) -.08 (.30) -.02 (.20) .07 (.23) -.11 (.25) .18 (.29) -.08 (.10) -.08 (.11) -.29* (.12) -.37** (.14) Married -.19 (.21) -.24 (.27) -.20 (.41) -.17 (.19) .04 (.19) -.02 (.17) -.08 (.09) Education - High School Degree .14 (.22) -.24 (.23) -.16 (.32) .30 (.44) .19 (.28) .08 (.27) .03 (.44) .57 (.52) .12 (.47) -.37 (.49) .30 (.55) -.30 (1.25) -.12 (.19) -.30 (.21) -.45 (.30) -.03 (.52) -.06 (.21) -.19 (.22) -.08 (.33) -.26 (.53) -.47* (.21) -.37+ (.21) -.19 (.28) .10 (.44) -.06 (.12) -.19+ (.10) -.13 (.14) .09 (.21) -.02 (.03) 4e-05 (3e-04) -.003 (.03) -2e-04 (4e-04) -.02 (.05) 1e-04 (.001) -.02 (.02) 3e-04 (2e-04) .003 (.02) 2e-05 (2e-04) -.11*** (.03) .001** (3e-04) -.03** (.01) 2e-04 (1e-04) Intercept .46 (.56) -.87 (.69) -2.49* (1.21) .28 (.50) -1.75** (.55) 4.57*** (.62) 2.87*** (.23) McFadden's Pseudo R 2 / R 2 -2 Log Likelihoood Percent Correctly Predicted (PCP) Null Percent Correctly predicted N .11 584.2 .78 .77 2010 .10 431.8 .87 .87 2010 .08 172.8 .96 .96 2010 .05 714.8 .71 .72 2010 .02 651.4 .76 .76 2010 .16 685.4 .72 .61 2010 .11 ---2010 Self-Report Variable Home Owner Income - $15,000-24,999 Income - $25,000-34,999 Income - $35,000-49,999 Income - $50,000-74,999 Income - $75,000-99,999 Income - $100,000-149,999 Income - $150,000 or more 2 Persons in Household 3 Persons in Household 4 Persons in Household 5 Persons in Household Education - Some College Education - College Degree Education - Graduate Degree Age Age 2 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 68 Table F3f – Best Household Match Weight Strict Missing Home Ownership Missing Household Income Missing Household Size Missing Marital Status Missing Education Missing Age Home Owner -1.16** (.33) -.93+ (.49) .18 (.43) -.29 (.24) .10 (.27) -.95** (.34) Number of Missing Variables (OLS) -.31** (.09) Income - $15,000-24,999 .08 (.51) -.23 (.51) -.17 (.48) -.29 (.43) -.60 (.91) -.68 (.57) -3.14 (361.63) -.13 (.70) -.10 (.61) -.29 (.54) .08 (.67) -.66 (1.12) -.67 (.82) -3.81 (1541.83) -.55 (.86) -.44 (.94) .36 (.62) -.82 (1.10) .01 (.76) -.57 (.94) -.19 (1.38) .07 (.37) .02 (.35) -.35 (.38) -.08 (.31) -.30 (.43) -.09 (.52) -.40 (.94) -.06 (.52) .07 (.53) .50 (.44) .41 (.47) .53 (.47) .10 (.53) .63 (.60) -.05 (.56) -.41 (.53) -.61 (.46) -.39 (.62) -.80 (.69) -.67 (.57) -.98 (.98) .01 (.17) -.06 (.18) -.06 (.14) -.05 (.17) -.08 (.19) -.14 (.15) -.03 (.26) -1.61*** (.43) -1.84*** (.48) -2.26*** (.63) -.76 (.49) -18.76 (1347.02) -19.16 (1986.46) -19.10 (2046.11) -19.16 (2465.22) -.38 (.50) .51 (.48) -.01 (.60) -.72 (.83) -1.08*** (.26) -.89** (.31) -1.31*** (.38) -1.81*** (.51) -.28 (.25) -.29 (.30) -1.66*** (.46) -.92* (.43) -20.80 (996.14) -21.29 (1484.59) -21.21 (1509.35) -21.49 (1818.57) -1.05*** (.10) -1.09*** (.11) -1.29*** (.12) -1.24*** (.14) Married -.40 (.42) -.004 (.62) -.40 (.52) -.53+ (.28) .08 (.26) -.15 (.52) -.05 (.10) Education - High School Degree .41 (.39) -.31 (.46) -.11 (.49) .23 (.90) .12 (.54) -.13 (.61) -.05 (.73) .58 (.91) .06 (.54) -.40 (.65) -.35 (.88) -5.76 (511.67) -.12 (.31) -.47 (.34) -.62+ (.37) -.14 (.64) .19 (.32) .15 (.27) .24 (.36) .20 (.70) -.58 (.52) -.36 (.50) .08 (.53) -.22 (.95) .02 (.11) -.10 (.10) -.04 (.14) .05 (.22) -.06 (.04) 3e-04 (5e-04) .07 (.07) -.001 (.001) -.10+ (.05) .001+ (.001) -.03 (.03) 4e-04 (3e-04) .09** (.04) -.001* (4e-04) -.12* (.05) .001 (.001) -.01 (.01) 4e-05 (1e-04) Intercept 1.74+ (.99) -2.09 (1.52) -.27 (1.29) .45 (.78) -4.26*** (.95) 5.48*** (1.41) 2.20*** (.30) McFadden's Pseudo R 2 / R 2 -2 Log Likelihoood Percent Correctly Predicted (PCP) Null Percent Correctly predicted N .27 480.7 .89 .88 920 .35 254.2 .94 .94 920 .07 303.6 .96 .96 920 .12 781.4 .80 .80 920 .06 806.1 .81 .81 920 .61 355.5 .90 .80 920 .32 ---920 Self-Report Variable Income - $25,000-34,999 Income - $35,000-49,999 Income - $50,000-74,999 Income - $75,000-99,999 Income - $100,000-149,999 Income - $150,000 or more 2 Persons in Household 3 Persons in Household 4 Persons in Household 5 Persons in Household Education - Some College Education - College Degree Education - Graduate Degree Age Age 2 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 69 Table F3g – Sole Household Match Weight Missing Home Ownership Missing Household Income Missing Household Size Missing Marital Status Missing Education Missing Age -1.22*** (.32) -.99* (.43) -.003 (.62) -.31 (.28) .02 (.31) -1.04*** (.29) Number of Missing Variables (OLS) -.40** (.12) .18 (.56) .08 (.53) -.23 (.52) .17 (.59) -.31 (.80) -.54 (.89) .11 (1.37) -.10 (.65) .15 (.63) -.24 (.62) .24 (.71) -.48 (1.32) -.14 (1.00) -.04 (1.39) -.98 (1.08) -.20 (.92) .01 (.91) -.63 (1.06) -3.57 (347.76) -3.96 (579.55) -12.38 (1162.99) -.02 (.40) .02 (.43) -.38 (.41) .11 (.43) -.18 (.44) -.16 (.41) .19 (.78) .26 (.53) .39 (.59) .73 (.44) .74+ (.42) .62 (.54) .24 (.58) .65 (.65) .28 (.61) -.22 (.57) -.52 (.49) -.34 (.52) -.83 (.64) -.47 (.64) -.78 (1.61) .09 (.23) .07 (.18) -.04 (.18) .08 (.19) -.07 (.19) -.10 (.18) .04 (.32) -1.73*** (.52) -1.98*** (.57) -2.66* (1.08) -.13 (.61) -18.31 (1282.61) -18.88 (1847.27) -18.66 (2196.71) -18.58 (2820.00) -.70 (.64) .30 (.54) -.08 (.75) -.19 (.86) -.90** (.27) -.97** (.33) -1.65** (.51) -1.63* (.65) -.35 (.29) -.31 (.34) -1.57** (.58) -1.64* (.77) -20.72 (1243.63) -21.32 (1789.17) -21.11 (2099.07) -21.51 (2628.75) -1.00*** (.11) -1.11*** (.13) -1.29*** (.16) -1.15*** (.19) Married -.60 (.59) -.19 (.63) .35 (.56) -.34 (.30) .12 (.34) -.07 (.44) -.04 (.12) Education - High School Degree .44 (.39) -.40 (.41) -.23 (.50) .52 (.88) .14 (.50) -.38 (.67) -.10 (.67) .95 (.93) .53 (.91) .02 (.82) .42 (1.13) -2.74 (839.54) -.03 (.32) -.40 (.40) -.37 (.39) .39 (.86) .37 (.37) .21 (.42) .11 (.46) .29 (.75) -.27 (.46) -.37 (.44) -.01 (.64) .18 (.80) .10 (.14) -.12 (.15) -.05 (.20) .31 (.33) -.04 (.05) 2e-05 (.001) .07 (.07) -.001 (.001) -.15* (.06) .001* (.001) -.03 (.03) 4e-04 (3e-04) .09* (.04) -.001+ (4e-04) -.12* (.05) .001 (.001) -.02 (.01) 1e-04 (1e-04) Intercept 1.28 (1.08) -2.14 (1.50) .63 (1.40) .62 (.82) -4.64*** (1.10) 5.27*** (1.40) 2.43*** (.36) McFadden's Pseudo R 2 / R 2 -2 Log Likelihoood Percent Correctly Predicted (PCP) Null Percent Correctly predicted N .27 404.5 .87 .86 672 .30 254.5 .92 .92 672 .11 224.2 .95 .95 672 .09 674.3 .76 .76 672 .07 605.3 .81 .81 672 .55 355.8 .87 .72 672 .32 ---672 Self-Report Variable Home Owner Income - $15,000-24,999 Income - $25,000-34,999 Income - $35,000-49,999 Income - $50,000-74,999 Income - $75,000-99,999 Income - $100,000-149,999 Income - $150,000 or more 2 Persons in Household 3 Persons in Household 4 Persons in Household 5 Persons in Household Education - Some College Education - College Degree Education - Graduate Degree Age Age 2 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 70 Figure F1. Missing Ancillary Data by Variables and Respondents Figure F1a – Unweighted, with Probability of Sampling Correction Only 50 b) Distribution of Missing Ancillary Data Across Respondents (N = 4472) 28.4 13.3 30 32.7 20 Proportion of Respondents (%) 15 20 20.3 46.4 40 46.4 25 22.0 10 7.3 11.3 5 10 Proportion Missing Ancillary Data (%) 30 a) Missing Ancillary Data by Variable (N = 4472) 3.7 5.0 2.7 Home Ownership Household Income Household Size Marital Status .2 0 0 1.7 Education Age 0 1 Variable 2 3 4 5 6 Number of Variables Missing Figure F1b – Pure Household Weight b) Distribution of Missing Ancillary Data Across Respondents (N = 2498) 50 30 a) Missing Ancillary Data by Variable (N = 2498) 28.5 40 33.8 30 Proportion of Respondents (%) 44.6 20 25 13.6 10 15 20 21.2 7.8 5 10 11.4 3.5 5.3 2.9 1.9 Home Ownership Household Income Household Size Variable Marital Status .2 0 0 Proportion Missing Ancillary Data (%) 44.6 24.2 Education Age 0 1 2 3 4 Number of Variables Missing 5 6 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 71 Figure F1c –Household Adult Weight b) Distribution of Missing Ancillary Data Across Respondents (N = 2400) 50 30 a) Missing Ancillary Data by Variable (N = 2400) 28.3 13.6 40 44.5 30 33.8 20 Proportion of Respondents (%) 25 15 20 21.3 10 7.8 11.4 5 10 Proportion Missing Ancillary Data (%) 44.5 24.4 3.6 5.3 2.8 Home Ownership Household Income Household Size Marital Status .2 0 0 1.9 Education Age 0 1 Variable 2 3 4 5 6 Number of Variables Missing Figure F1d – Best Respondent Match Weight b) Distribution of Missing Ancillary Data Across Respondents (N = 2498) 50 30 a) Missing Ancillary Data by Variable (N = 2498) 28.5 40 33.6 30 Proportion of Respondents (%) 13.6 44.7 20 25 15 20 21.2 10 7.8 11.5 5 10 Proportion Missing Ancillary Data (%) 44.7 24.2 3.6 5.4 2.9 Home Ownership Household Income Household Size Marital Status .2 0 0 1.8 Education Age 0 1 Variable 2 3 4 5 6 Number of Variables Missing Figure F1e – Best Household Match Weight Generous b) Distribution of Missing Ancillary Data Across Respondents (N = 1166) 30 26.5 20 24.0 16.3 40 30 34.1 24.4 24.4 20.3 20 40 Proportion of Respondents (%) 50 50 59.0 32.6 10 10 11.1 6.0 3.8 4.1 Home Ownership Household Income Household Size Variable Marital Status .4 0 0 Proportion Missing Ancillary Data (%) 60 a) Missing Ancillary Data by Variable (N = 1166) Education Age 0 1 2 3 4 Number of Variables Missing 5 6 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 72 Figure F1f – Best Household Match Weight Strict b) Distribution of Missing Ancillary Data Across Respondents (N = 909) 50 25 a) Missing Ancillary Data by Variable (N = 909) 50.8 50.8 21.7 10 7.0 40 30 10 4.2 31.1 20 15 Proportion of Respondents (%) 20 20.3 11.9 5 Proportion Missing Ancillary Data (%) 20.8 8.8 5.7 Home Ownership Household Income Household Size Marital Status 1.6 0 0 1.9 Education Age 0 1 Variable 2 3 4 5 .1 6 Number of Variables Missing Figure F1g – Sole Household Match Weight b) Distribution of Missing Ancillary Data Across Respondents (N = 672) 50 a) Missing Ancillary Data by Variable (N = 672) 30 29.2 43.2 4.4 5 40 30 33.7 10.9 10 9.5 43.2 20 Proportion of Respondents (%) 25 14.9 10 15 20 20.5 7.7 2.2 2.2 Home Ownership Household Income Household Size Variable Marital Status .2 0 0 Proportion Missing Ancillary Data (%) 26.1 Education Age 0 1 2 3 4 Number of Variables Missing 5 6 ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 73 Figure F2. Missing Ancillary Data by Self-Reported Value Figure F2a – Unweighted, with Probability of Sampling Correction Only 50 b) Distribution of Missing Household Income Data by Self−Reported Income Category 20 40 2 33.1 30 30 33.1 c (7, 2941) = 116.8*** 18.6 20 40 Proportion Missing Ancillary Household Income Data (%) 2 12.9 12.9 11.4 10 c (1, 3369) = 501.7*** 10 Proportion Missing Ancillary Home Ownership Data (%) 50 a) Distribution of Missing Home Ownership Data by Self−Reported Home Ownership 5.9 Non−Owner 50−75 75−100 2.2 15−25 25−35 35−50 100−150 d) Distribution of Missing Marital Status Data by Self−Reported Marital Status 50 c) Distribution of Missing Household Size Data by Self−Reported Household Size 20 4.5 2.7 c (1, 2661) = 135.1*** 40 2 31.1 30 30 40 Proportion Missing Ancillary Marital Status Data (%) 2 20 c (4, 4464) = 28.1*** 150+ 12.7 10 50 Self−Reported Income Categor y 10 3.9 0 0 1.2 2 3 4 5+ Not Married e) Distribution of Missing Education Data by Self−Reported Education Category f) Distribution of Missing Age Data by Self−Reported Age Category 50 Self−Reported Marital Status c2 (4, 2661) = 3.1 40 20.5 38.6 38.6 32.2 30 Proportion Missing Ancillary Age Data (%) 23.4 23.4 16.7 14.9 14.3 55−64 65−74 10 10 20 20.1 20 20.7 c2 (6, 4136) = 281.1*** 46.6 40 30 25.1 20.1 Married Self−Reported Household Size 50 1 Less Than High School High School Graduate Some College College Degree Self−Reported Education Categor y Post−Graduate Education 0 Proportion Missing Ancillary Household Size Data (%) 0−15 Owner 5.9 Proportion Missing Ancillary Education Data (%) 3.5 Self−Reported Home Ownership 3.9 0 4.5 3.5 0 0 4.6 18−24 25−34 35−44 45−54 Self−Reported Age Category 75+ ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 74 Figure F2b – Pure Household Weight a) Distribution of Missing Home Ownership Data by Self−Reported Home Ownership c (1, 1873) = 274.2*** 40 50 2 30 33.2 20 20 30 33.2 c (7, 1648) = 56.7*** 17.5 13.5 13.5 12.1 10 40 Proportion Missing Ancillary Household Income Data (%) 50 2 10 Proportion Missing Ancillary Home Ownership Data (%) b) Distribution of Missing Household Income Data by Self−Reported Income Category 7.0 4.8 Non−Owner 2.6 0−15 Owner 15−25 25−35 35−50 50−75 75−100 100−150 Self−Reported Home Ownership Self−Reported Income Categor y c) Distribution of Missing Household Size Data by Self−Reported Household Size d) Distribution of Missing Marital Status Data by Self−Reported Marital Status c (4, 2494) = 11.6* 3.3 5.6 40 50 2 30 34.8 13.6 4.1 3.7 2.8 c (1, 1614) = 99.9*** 20 Proportion Missing Ancillary Marital Status Data (%) 40 30 20 4.1 150+ 10 50 2 10 Proportion Missing Ancillary Household Size Data (%) 3.6 0 0 4.2 0 0 1.0 1 2 3 4 5+ Not Married Married Self−Reported Household Size Self−Reported Marital Status e) Distribution of Missing Education Data by Self−Reported Education Category f) Distribution of Missing Age Data by Self−Reported Age Category 50 50 c2 (4, 1614) = 2.3 c2 (6, 2395) = 201.0*** 47.8 45.2 40 24.2 21.0 30 33.7 22.6 17.8 14.9 10 13.7 Less Than High School High School Graduate Some College College Degree Self−Reported Education Categor y Post−Graduate Education 0 10 20 20.8 20 20.8 22.3 Proportion Missing Ancillary Age Data (%) 40 30 26.0 0 Proportion Missing Ancillary Education Data (%) 45.2 18−24 25−34 35−44 45−54 55−64 Self−Reported Age Category 65−74 75+ ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 75 Figure F2c –Household Adult Weight a) Distribution of Missing Home Ownership Data by Self−Reported Home Ownership c (1, 1801) = 261.1*** 40 50 2 30 33.2 20 20 30 33.2 c (7, 1584) = 55.6*** 17.2 14.1 14.1 12.1 10 40 Proportion Missing Ancillary Household Income Data (%) 50 2 10 Proportion Missing Ancillary Home Ownership Data (%) b) Distribution of Missing Household Income Data by Self−Reported Income Category 7.0 4.9 Non−Owner 2.7 2.6 0−15 Owner 15−25 25−35 35−50 50−75 75−100 100−150 150+ Self−Reported Home Ownership Self−Reported Income Categor y c) Distribution of Missing Household Size Data by Self−Reported Household Size d) Distribution of Missing Marital Status Data by Self−Reported Marital Status c (4, 2396) = 11.9* 40 50 2 30 36.0 20 Proportion Missing Ancillary Marital Status Data (%) 40 30 20 c (1, 1570) = 107.8*** 13.6 10 50 2 10 Proportion Missing Ancillary Household Size Data (%) 3.4 0 0 4.3 5.7 4.1 4.1 3.5 2.7 0 0 1.0 1 2 3 4 5+ Not Married Married Self−Reported Household Size Self−Reported Marital Status e) Distribution of Missing Education Data by Self−Reported Education Category f) Distribution of Missing Age Data by Self−Reported Age Category 50 50 c2 (4, 1570) = 2.6 c2 (6, 2396) = 201.6*** 48.0 23.7 20.9 40 45.2 30 33.5 22.7 17.9 14.9 10 13.7 Less Than High School High School Graduate Some College College Degree Self−Reported Education Categor y Post−Graduate Education 0 10 20 20.2 20 20.2 Proportion Missing Ancillary Age Data (%) 40 30 25.9 22.2 0 Proportion Missing Ancillary Education Data (%) 45.2 18−24 25−34 35−44 45−54 55−64 Self−Reported Age Category 65−74 75+ ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 76 Figure F2d – Best Respondent Match Weight a) Distribution of Missing Home Ownership Data by Self−Reported Home Ownership b) Distribution of Missing Household Income Data by Self−Reported Income Category c (1, 1872) = 273.8*** c (7, 1648) = 56.7*** 50 40 30 33.2 20 Proportion Missing Ancillary Household Income Data (%) 50 40 17.5 13.5 13.5 12.1 10 20 30 33.2 10 Proportion Missing Ancillary Home Ownership Data (%) 60 2 60 2 7.0 4.2 3.6 50−75 75−100 Non−Owner 3.3 0−15 Owner 15−25 25−35 35−50 100−150 150+ Self−Reported Home Ownership Self−Reported Income Categor y c) Distribution of Missing Household Size Data by Self−Reported Household Size d) Distribution of Missing Marital Status Data by Self−Reported Marital Status c (4, 2494) = 11.7* c (1, 1632) = 118.4*** 50 40 20 30 36.3 13.3 10 10 20 30 40 50 Proportion Missing Ancillary Marital Status Data (%) 60 2 60 2 Proportion Missing Ancillary Household Size Data (%) 2.6 0 0 4.9 4.1 5.4 4.2 2.9 4.1 0 0 .7 1 2 3 4 5+ Not Married Self−Reported Marital Status e) Distribution of Missing Education Data by Self−Reported Education Category f) Distribution of Missing Age Data by Self−Reported Age Category c2 (4, 1632) = 3.9 c2 (6, 2458) = 292.2*** 62.6 40 50 49.9 30 31.3 21.1 20.3 20 20.5 20 22.0 25.5 Proportion Missing Ancillary Age Data (%) 60 62.6 60 50 40 30 27.3 21.1 17.1 10 14.6 Less Than High School High School Graduate Some College College Degree Self−Reported Education Categor y Post−Graduate Education 0 10 12.7 0 Proportion Missing Ancillary Education Data (%) Married Self−Reported Household Size 18−24 25−34 35−44 45−54 55−64 Self−Reported Age Category 65−74 75+ ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 77 Figure F2e – Best Household Match Weight Generous a) Distribution of Missing Home Ownership Data by Self−Reported Home Ownership b) Distribution of Missing Household Income Data by Self−Reported Income Category c (1, 852) = 119.4*** c (7, 743) = 27.1*** 60 40 41.7 28.4 21.9 23.3 21.9 20 Proportion Missing Ancillary Household Income Data (%) 60 40 41.7 20 Proportion Missing Ancillary Home Ownership Data (%) 80 2 80 2 15.2 11.1 9.4 Non−Owner 0−15 Owner 15−25 25−35 35−50 50−75 75−100 150+ c) Distribution of Missing Household Size Data by Self−Reported Household Size d) Distribution of Missing Marital Status Data by Self−Reported Marital Status c (1, 740) = 24.0*** 5.4 3.0 60 40 23.1 5.3 4.7 0 1.0 0 5.3 40.1 20 Proportion Missing Ancillary Marital Status Data (%) 80 2 80 60 40 20 Proportion Missing Ancillary Household Size Data (%) 100−150 Self−Reported Income Categor y 2 1 2 3 4 5+ Not Married Married Self−Reported Household Size Self−Reported Marital Status e) Distribution of Missing Education Data by Self−Reported Education Category f) Distribution of Missing Age Data by Self−Reported Age Category 87.0 80 31.6 26.1 Less Than High School High School Graduate Some College 24.6 c2 (6, 1132) = 153.1*** 87.0 77.9 60 63.6 40 49.2 34.8 36.2 37.9 26.3 College Degree Self−Reported Education Categor y Post−Graduate Education 0 20 26.8 20 26.3 Proportion Missing Ancillary Age Data (%) 40 60 80 c2 (4, 740) = 1.1 Proportion Missing Ancillary Education Data (%) 7.8 Self−Reported Home Ownership c (4, 1163) = 5.5 0 7.1 0 0 9.6 18−24 25−34 35−44 45−54 55−64 Self−Reported Age Category 65−74 75+ ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 78 Figure F2f – Best Household Match Weight Strict a) Distribution of Missing Home Ownership Data by Self−Reported Home Ownership b) Distribution of Missing Household Income Data by Self−Reported Income Category c (1, 642) = 109.5*** c (7, 586) = 22.8** 40 30 20 Proportion Missing Ancillary Household Income Data (%) 30.9 14.2 14.2 12.9 11.5 10 40 20 30 30.9 10 Proportion Missing Ancillary Home Ownership Data (%) 50 2 50 2 5.2 4.8 1.5 1.7 75−100 100−150 Non−Owner 0−15 Owner 15−25 25−35 35−50 50−75 Self−Reported Home Ownership Self−Reported Income Categor y c) Distribution of Missing Household Size Data by Self−Reported Household Size d) Distribution of Missing Marital Status Data by Self−Reported Marital Status c (4, 904) = 8.3+ 150+ c (1, 535) = 44.7*** 40 20 30 33.1 9.6 10 10 20 30 40 Proportion Missing Ancillary Marital Status Data (%) 50 2 50 2 Proportion Missing Ancillary Household Size Data (%) .00000 0 0 2.6 7.7 5.2 5.2 4.5 2.4 0 0 .6 1 2 3 4 5+ Not Married Self−Reported Marital Status e) Distribution of Missing Education Data by Self−Reported Education Category f) Distribution of Missing Age Data by Self−Reported Age Category c2 (4, 535) = 2.6 c2 (6, 904) = 49.2*** 53.7 16.8 40 30 25.4 17.2 16.9 15.7 13.4 Less Than High School High School Graduate Some College College Degree Self−Reported Education Categor y Post−Graduate Education 0 10 16.8 35.0 20 21.6 Proportion Missing Ancillary Age Data (%) 50 53.7 50 40 30 20 20.7 27.9 10 Proportion Missing Ancillary Education Data (%) 26.2 0 Married Self−Reported Household Size 18−24 25−34 35−44 45−54 55−64 Self−Reported Age Category 65−74 75+ ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 79 Figure F2g – Sole Household Match Weight a) Distribution of Missing Home Ownership Data by Self−Reported Home Ownership b) Distribution of Missing Household Income Data by Self−Reported Income Category c (1, 405) = 78.5*** c (7, 380) = 13.7+ 70 60 50 40 30 36.6 17.1 17.1 16.9 13.8 9.1 10 20 30 36.6 20 Proportion Missing Ancillary Household Income Data (%) 60 50 40 2 10 Proportion Missing Ancillary Home Ownership Data (%) 70 2 6.4 3.5 3.0 3.2 75−100 100−150 0 0 .00000 Non−Owner 0−15 Owner 15−25 25−35 35−50 50−75 Self−Reported Home Ownership Self−Reported Income Categor y c) Distribution of Missing Household Size Data by Self−Reported Household Size d) Distribution of Missing Marital Status Data by Self−Reported Marital Status c (4, 667) = 7.8 c (1, 296) = 11.0*** 70 60 50 40 30 5.2 4.6 1.7 0 0 1.6 1 2 3 4 5+ Not Married Married Self−Reported Household Size Self−Reported Marital Status e) Distribution of Missing Education Data by Self−Reported Education Category f) Distribution of Missing Age Data by Self−Reported Age Category c2 (6, 667) = 59.3*** 70.7 20.2 22.4 50 40 22.4 18.4 10 17.3 Less Than High School High School Graduate Some College College Degree Self−Reported Education Categor y Post−Graduate Education 0 10 18.0 37.5 30 Proportion Missing Ancillary Age Data (%) 23.8 22.5 18.0 48.2 20 30 40 50 60 70 70.7 60 70 c2 (4, 296) = 1.0 24.7 20 18.3 10 5.2 Proportion Missing Ancillary Education Data (%) 38.7 20 20 30 40 50 Proportion Missing Ancillary Marital Status Data (%) 60 2 8.9 10 Proportion Missing Ancillary Household Size Data (%) 70 2 0 150+ 18−24 25−34 35−44 45−54 55−64 Self−Reported Age Category 65−74 75+ ACCURACY & COMPLETENESS IN CONSUMER FILE DATA 80

ACCURACY & COMPLETENESS IN CONSUMER FILE DATA Can

Related documents

Products

Support

ACCURACY & COMPLETENESS IN CONSUMER FILE DATA Can

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib