Spatial Variation in the Quality of American Community Survey Estimates David C. Folch, Daniel Arribas-Bel, Julia Koschinsky and Seth E. Spielman 2014 Working Paper Number 01 Spatial Variation in the Quality of American Community Survey Estimates David C. Folch∗ Florida State University Daniel Arribas-Bel University of Birmingham Julia Koschinsky Arizona State University Seth E. Spielman University of Colorado at Boulder WORKING PAPER February 28, 2015 Abstract Social scientific research, public and private sector decisions and allocations of federal resources often rely on data from the American Community Survey (ACS). However, this critical data source is plagued by high uncertainty in some of its most frequently used estimates. Using 2006–2010 ACS median household income estimates at the census tract scale as the test case, we explore spatial and non-spatial patterns in ACS estimate quality. We find spatial patterns of uncertainty in the northern U.S. to be different from the southern U.S., and to be different in suburbs from urban cores; in both cases uncertainty is lower in the former than the latter. Further, uncertainty is higher in areas with lower incomes. A series of multivariate spatial regression models is used to describe the patterns of association between uncertainty in estimates and economic, demographic and geographic factors, controlling for the number of responses. We find that these demographic and geographic patterns in estimate quality persist even after accounting for response level. The implication is that data quality varies in different places, making cross-sectional analysis both within and across regions less reliable. We also present advice for data users and potential solutions to the challenges identified. Key Words: American Community Survey, data uncertainty, income estimates, margin of error, spatial analysis JEL Classification: C81, R12. ∗ Corresponding author (dfolch@fsu.edu) 1 1 Introduction The data produced by the U.S. Bureau of the Census (BOC) in the American Community Survey (ACS) are crucial inputs to both social science research and public and private sector decisions. However, these uses of the data are complicated by the high margins of error (MOE) commonly found in the ACS estimates. High MOEs are especially common when the estimate reflects a small subset of the population or covers a small geographic area such as a census tract. For example, over 44 percent of the ACS census tract estimates of children in poverty have an MOE at least as large as the estimate itself.1 These high margins of error can blur our understanding of places; for example, in census tract 190602 in the Belmont Cragin neighborhood of Chicago, Illinois the number of children in poverty is somewhere between 9 and 965. There is indeed growing recognition of the challenges associated with using ACS estimates due to their high levels of uncertainty. This general recognition is the result of various academic articles (MacDonald, 2006; Salvo and Lobo, 2006; Citro and Kalton, 2007; Spielman et al., 2014; Bazuin and Fraser, 2013) and the BOC’s own efforts to publicize the uncertainty in the data (U.S. Census Bureau, 2009a). However, the nature and causes of the uncertainty in the ACS are not widely understood. As we demonstrate in this research, a major cause for concern is that the uncertainty generally does not follow a random pattern across attributes and space. This implies that there is systematic variation in the quality of ACS data—some types of places have higher uncertainty than others. This paper examines the patterns of uncertainty in median household income estimates from the 2006-2010 ACS at the census tract scale. This variable is of broad interest and has broad impacts. Median household income estimates are used to determine market demand for services in the public, private and nonprofit sectors, to make decisions about retail site location and to study income inequality, to name a few examples. We illustrate patterns in the quality of ACS median household income estimates. We then explain these patterns through a series of spatial regression models. These models identify key spatial and demo1 Children in poverty is defined as people under 18 living in a family whose income is below the poverty level and who are related to the householder by birth, marriage or adoption. In 32,007 out of the 72,539 census tracts (44.1 percent) in the contiguous US, the MOE is greater than or equal to the estimate. 2 graphic determinants of the variation in the quality of ACS estimates. We conclude with a discussion of strategies for handling uncertainty in ACS estimates. 2 Constructing ACS Estimates Before 1940, the BOC collected decennial census data from every (or nearly every) person and household in the country using a single list of questions. Between 1940 and 2000, the BOC used a two-tiered data collection approach with a short list of questions presented to the entire country and a supplemental list asked of a sample of the respondents. This split in questions evolved into what was known as the “short form” and “long form.” Starting in 2005, the BOC shifted the long form questions to an annual data collection model called the American Community Survey, and confined future decennial censuses to short form questions only. Census data has always contained non-sampling errors based on myriad known and unknown data collection challenges; since 1940, all census data generated from the long form also has sampling error. While the long form questions remained largely consistent between the 2000 decennial census and the ACS, the data collection strategy changed dramatically. The ACS contacts approximately 3.54 million households each year (3 million prior to 2011), which is enough surveys to make estimates for large cities and counties (more than 65,000 residents) every year. Locations with 20,000 or more residents receive estimates based on data aggregated over three years. The estimates for all other census geographies, no matter their population size, are built from five years of data. In principle this interlacing of geographic and temporal scales allows the BOC to manage estimate quality while releasing data annually.2 However, these structural changes in data collection were accompanied by a large reduction in the sample size. The median number of completed long form surveys per census tract in 2000 was 249, but for the 2006–2010 ACS it was only 123 (see Figure 1). The ACS provides over 1000 data tables for every geographic area in the country. These represent high level variables like total population, and various cross-tabulations that cap2 By 2009, the BOC had collected sufficient data to start distributing 1-year, 3-year and 5-year estimates each year. 3 Median ACS 20 American Community Survey (2006−2010) Median Decennial Percent of Tracts 15 10 Decennial Census Long Form (2000) 5 0 0 250 500 750 1000 1250 1500 Responding Housing Units by Census Tract Figure 1: Response (count) by Census Tract, Census 2000 Long Form and ACS 2006–2010 ture a wide assortment of subpopulations in the country. While the ACS can measure total population for most geographic areas with little uncertainty, the uncertainty for subpopulations can be high and vary widely from place to place. An established and commonly used measure of uncertainty is the coefficient of variation (CV), which is the standard error of an estimate divided by the estimate itself. It can be computed from published ACS data, and provides a relative measure of uncertainty—essentially the error measured as a percent of the estimate. The first five box plots in Figure 2 highlight the distribution of census tract scale CV for estimates of the size of various population groups, and the latter three box plots show the same for median household income. The plots show striking differences in both the magnitude and range of uncertainty for different subsets of the overall data. Superimposed on each box plot is a point representing the median CV value for counties (diamonds) and census block groups (squares), respectively larger and smaller than census tracts. Seven of the eight plots show that larger places tend to have lower uncertainty and smaller places higher; however, there is no consistent pattern in where the points fall relative to the tract uncertainty distribution. The subpopulation of 65 and older female Hispanic residents represents a difficult to capture group at small geographic scales. There is no data reported for any block group, and many counties and tracts are also suppressed. The result is that the 4 Coefficient of Variation (CV) 1.5 tract distribution block group median county median 1.0 0.5 0.0 All Female 65 and older Hispanic Fem Hisp 65+ All HH POPULATION 65 and older Hispanic MEDIAN HOUSEHOLD INCOME Figure 2: Coefficient of Variation Distribution for Selected Variables and Spatial Scales, ACS 2006–2010 Note: Age and ethnic differences in median household income are based on the head of household. county and tract median values are nearly identical. The ACS is a survey, and as such is susceptible to the same challenges as any other estimates based on a sample of the population. The “rolling” sample strategy employed by the BOC forces designers and users of the ACS to confront the three-way tension between attribute, temporal and spatial resolution — given the fixed number of surveys collected each year, it is not possible to create an estimate that has detailed resolution along all three dimensions and has low uncertainty. The ACS approach is to continuously collect surveys, and then tabulate those surveys into an enormous array of estimates. Using fewer completed surveys to build an estimate makes it more uncertain. Thus for census tracts the BOC bolsters the estimates for small areas by sacrificing temporal resolution, but as seen 5 in Figure 2 even five years of data is not always enough to generate reliable estimates. The following section uses a fixed spatial resolution (census tracts) and fixed attribute (median household income) to show how ACS estimate quality varies from place to place. 3 Patterns in Estimate Quality All ACS estimates are associated with some level of uncertainty. However, places with higher uncertainty are not distributed randomly—high uncertainty is not equally likely in all locations and for all demographic groups. Higher levels of uncertainty in median household income estimates exist in the South and Southwest of the United States, near city centers, and in places with lower median incomes (Figures 4, 6, 7 and 8). Interestingly, inner-cities and low income households, the groups and areas typically of most interest to social science researchers, tend to be associated with the greatest uncertainty in median household income estimates. Median household income is a key indicator of socio-economic status and its uncertainty, as measured by the coefficient of variation, is the focus of our analysis. There is no clear cutoff for what qualifies as an “acceptable” CV. A comprehensive report on the ACS (Citro and Kalton, 2007) produced for the National Research Council (NRC) states that the maximum acceptable CV should be in the 0.10–0.12 range, while noting that “what constitutes an acceptable level of precision for a survey estimate depends on the uses to be made of the estimate” (p.67). A white paper produced by the software company ESRI (ESRI, 2011) characterizes a CV below 0.12 as “high reliability”, 0.12–0.40 as “medium reliability” and anything above 0.40 as “low reliability.” Figure 3 presents the distribution of the CV of median household income for over 70,000 census tracts in the continental U.S. for the 2006–2010 ACS data. The median CV is 0.095, with a slightly higher average of 0.110 due to the long tail in the distribution. Using the high end of the NRC range (0.12), approximately one third (32.1%) of U.S. census tracts have too much uncertainty while two-thirds (67.9 percent) would be considered acceptable.3 3 These estimates are based on census tracts in the continental U.S. with outlier tracts removed (see Section 4.2). 6 0.20 0.10 Median Mean 0.12 Percent of Tracts 0.15 0.05 0.00 0.0 0.2 0.4 0.6 Median Household Income Coefficient of Variation (CV) Note: Figure truncated at CV of 0.75. Figure 3: Distribution of Median Household Income Coefficient of Variation, U.S. Census Tracts ACS 2006–2010 While the magnitude and distribution of median income CVs is concerning, it does not necessarily imply systematic patterns in the uncertainty. We explore the potential for spatial patterning in the CVs using the Moran’s I measure of local spatial autocorrelation (Anselin, 1995). This measure identifies locations on a map where abnormally high or low values are clustered together. It further provides a statistical test to determine if the cluster is significant.4 Figure 4 highlights census tracts in red that are part of statistically significant clusters of high CV, census tracts in blue are in low CV clusters and white tracts are not statistically significant. If there were no spatial pattern to the magnitude of CVs, i.e., if they followed a random spatial distribution, then the map would be mostly white indicating that tracts with high (or low) CVs do not cluster together. However, the map shows that the north of the country, particularly the Midwest, has concentrations of high quality (low uncertainty) estimates, denoted in blue; while red clusters, indicating low quality estimates (high uncertainty) are located more in the South and Southwest. Since census tracts average approximately 4,000 people, at this scale the map visually emphasizes low density parts of 4 The statistic is computed for each census tract on the map. Statistical significance is based on 999 random permutations of the data, and a significance level of 0.05. 7 Figure 4: Concentrations of Median Household Income Coefficient of Variation, Census Tracts ACS 2006–2010 Note: Results based on the local Moran’s I statistic. Red tracts are in statistically significant high CV clusters, blue tracts are in clusters of low CVs and white tracts are not part of a statistically significant cluster. the country. Figure 5 zooms in further to present the CV distribution for nine metropolitan areas across the country. In Madison, Wisconsin, for example, approximately 75 percent of census tracts are below the national median uncertainty level; in contrast, nearly 75 percent of the New Orleans, Louisiana census tracts are above this level. Other metropolitan areas such as Chicago and San Diego have median levels nearly identical to the national median. This diversity in the magnitude and range of uncertainty across metropolitan areas can affect the reliability of cross sectional analyses. The spatial variation in attribute uncertainty is not simply a macro scale phenomenon that varies from region to region, but also manifests itself within regions. Figure 6 presents clusters of high and low CVs within the Chicago, Illinois metropolitan area. The map highlights multiple high CV concentrations around central Chicago and one in central Gary, 8 ● ● ● ● 0.9 CV of Median Household Income ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● San Antonio San Diego ● 0.0 Chicago Madison Minneapolis Mobile New Orleans New York St. Louis Metropolitan Statistical Areas Figure 5: Distribution of Uncertainty on Median Household Income, Selected Metropolitan Area Census Tracts ACS 2006–2010 Note: Horizontal line represents U.S. median value of 0.095. Two outlier census tracts not shown: New York with CV of 2.73 and Chicago with CV of 1.41 Indiana. The low CV areas are generally located in the exurban periphery of the region, with notable exceptions such as the village of Oak Lawn, Illinois to the southwest of downtown Chicago. The spatial pattern seen in Chicago is indicative of a broader pattern of higher CVs near city centers and lower CVs at city edges. We explore this pattern by grouping census tracts from the largest 150 metropolitan statistical areas (MSAs)5 based on relative distance from their city center.6 This approach allows us to standardize distances from regions of all sizes and shapes, e.g., places with a waterfront downtown like San Diego, to places with a centered downtown like Denver.7 Figure 7 summarizes the data from 150 MSAs, and shows that there 5 Based on the December 2009 definition of Core Based Statistical Areas. We order the census tracts within an MSA by increasing distance from their city center, and then split this ordered list into 100 bins (percentiles). We repeat this process for the 150 largest MSAs, and then pool the tracts by percentile for all MSAs into a single set of 100 bins. When completed, the first bin has the urban core of all the MSAs and the 100th bin has the urban fringe of each MSA. The points in Figure 7 represent the median CV value from each bin. 7 Urban centers are identified using the U.S. Geological Survey’s Geographic Names Information System. The latitude-longitude marker for the first city listed in the MSA name is extracted from the database, and then distances are computed from each census tract centroid to the urban center. 6 9 Figure 6: Concentrations of Median Household Income Coefficient of Variation, Chicago Metropolitan Area Census Tracts ACS 2006–2010 Note: Results based on the local Moran’s I statistic. Red tracts are in statistically significant high CV clusters, blue tracts are in clusters of low CVs and white tracts are not part of a statistically significant cluster. 10 0.14 ● ● ●● ● ● ● ● ● ● ● 0.12 ● ● Median CV ● ●● ● ● ● ● ● ● ● ● ● ● ● 0.10 ● ● ● ● ● ● ● U.S. Median CV ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 0.08 0 25 50 75 100 Distance to Urban Center (Percentile) Figure 7: Summary of Median Household Income Coefficient of Variation Based on Distance from Urban Center, Census Tracts for the Largest 150 Metropolitan Areas, ACS 2006–2010 is a steep decline in the CV as distance initially increases from urban cores, which eventually moderates and then begins increasing again when reaching the peripheries of the regions. In addition to a clear spatial structure, CVs of median household income in large U.S. metropolitan areas also display a pattern across income levels. Figure 8 is similar in design to Figure 7, except that we order tracts by increasing income within their MSA before splitting them into percentile bins. This allows us to control for inter-metropolitan variations in income. The results show that uncertainty in median household income declines as median household income increases. The similar patterns in Figures 7 and 8 are likely related, as some MSAs’ lower income residents live closer to the urban core. The increasing level of uncertainty at the urban periphery is likely caused by the diversity of exurban locations, which can range from wealthy suburban enclaves to lower income agricultural communities. 4 Multivariate Methodology and Data The previous section shows that there are clear geographic and socio-economic patterns in the quality of ACS median household income estimates. In this section we introduce a mul- 11 ● ● ● ● 0.150 ● ● ● Median CV ●● ● ● ● 0.125 ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 0.100 ● U.S. Median CV ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● 0.075 0 25 50 75 ● ● ● ●●● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ● 100 Household Income (Percentile) Figure 8: Variation in Uncertainty on Median Household Income Coefficient of Variation Based on Increasing Income, Largest 150 Metropolitan Areas Census Tracts ACS 2006–2010 tivariate model to examine the robustness of those findings while simultaneously controlling for other potential factors that might affect the quality of ACS estimates. Examples of these other correlates include measures of demographic diversity, ACS survey response level and population change. We first introduce a national scale spatial regression framework that accounts for spatial patterns in the data. Next, we extend this model to assess the stability of these results between the North and South, motivated by the North-South variation in uncertainty found in Figure 4. Finally, we subset the data to assess the relationship between distance to urban center and income uncertainty for tracts within metropolitan areas. 4.1 Model The national model to assess the relationship between income uncertainty and response level, diversity, population change and other factors is specified by the following multivariate equation: log yi = α + δ log hu respondi + φ log Xi + β log Hi + γ log Ti + ui 12 (1) where yi represents the median household income coefficient of variation in tract i. The median household income CV in tract i is explained by ACS response level (hu respondi ), a set of five socio-demographic variables related to race and diversity in the tract (Xi ), two characteristics of the housing environment indicating potential residential instability (Hi ) and four indicators of population change and distance to urban core (Ti ). A more detailed rationale for why these variables are included can be found in section 4.2. Table 1 summarizes the description of the variables and their sources. The equation takes a log-log specification that allows the 12 regression coefficients δ, φ, β and γ to be interpreted as elasticities, expressing the percentage change expected on yi given a one percentage increase in the explanatory variable. Given the fine-grained scale of census tracts, it is likely that some of the unobserved characteristics captured by the error term are spatially correlated, in which case the estimates lose precision (Anselin, 1988). To account for this form of spatial dependence, we assume a spatial auto-regressive error term of the following structure: ui = λ X wij uj + i (2) j where wij is the ijth element of a spatial weights matrix W that formally represents the spatial connectivity of tracts, and i is an i.i.d. and well-behaved disturbance. All the results shown relate to a matrix built using the queen contiguity criterion, under which two observations are neighbors, and thus assigned a weight of one, if they share a border of any length, including a single point. This matrix is then standardized so every row P sums to one, effectively converting j wij uj into the average value of u in the surroundings of i. These models are estimated with a generalized method of moments, following the approach proposed by Arraiz et al. (2010), which suggests an estimator robust to spatial autocorrelation and heteroskedasticity. We next explore regional variations in the correlates of income CV. To do so, we apply a so-called spatial regimes approach that determines whether factors associated with CV play a different role in the northern and southern parts of the country (Figure 9). In essence, the national model is estimated separately for the North and South (spatial regimes). The 13 Figure 9: North-South Regimes census tract remains the unit of analysis. The baseline model of Equation 1 is expanded with spatial regimes yielding the following: log yir = αr + δr log hu respondir + φr log Xir + βr log Hir + γr log Tir + uir , (3) where the subscript r indicates membership in the North or South; otherwise the specification remains the same. This means we obtain a set of (α, δ, φ, β, γ) parameters for both the North and the South of the United States. Also, we subset the spatial weights matrix W so only neighbors within the same region remain connected. This formulation is equivalent to running separate regressions for each group of observations under the same r subscript. The advantage of this framework is that it allows for a significance test of the equality of estimated parameters between regions by means of the spatial Chow test (Anselin, 1990). Finally, we examine the relationship between income uncertainty and distance to urban center. Since this relationship can only be assessed within metropolitan areas, we restrict this analysis to all census tracts in the 150 largest metropolitan statistical areas (MSAs) in the country. MSAs are defined by the U.S. Office of Management and Budget, and approximate a functional urban region. Although MSAs do not cover the entire extent of the country, these 150 MSAs were home to approximately 84 percent of the U.S. population in 2010. 14 4.2 Data The dependent variable is the coefficient of variation (CV) of median household income. Due to the complexity of the sampling and weighting processes used to compute ACS estimates, the BOC estimates standard errors (the numerator of the CV) using a method called successive differences replication. The successive differences replication approach recomputes all ACS estimates 80 times using different base weights on the completed surveys, and then uses all this information to compute the variability in the actual estimate (see U.S. Census Bureau, 2009b, Chapter 12 for more details). When considering potential determinants of uncertainty in ACS estimates we first consider the total number of surveys collected in the particular tract (hu respond). More responses are expected to reduce the CV. This is the only variable in the model taken from the ACS. Unlike most ACS data this variable is a count of completed surveys not an estimate. We specifically do not include other ACS variables in the model because they are expected to be contaminated by the same problems we aim to understand. It is important that the variables used to explain CV in median household income are, as much as possible, unaffected by measurement error. As a result, the pool of potential explanatory variables is heavily constrained, so we turn to the 2010 decennial census and a 2010 restricted-use database from the U.S. Department of Housing and Urban Development (HUD). Since the ACS data are collected over five years, there is some degree of temporal mismatch; however, the extent to which this affects our results is limited since the variables used are rather persistent over time. The variables used and their sources are summarized in Table 1. Our concern in these multivariate models relates to potential covariation of socio-economic and/or geographic attributes with the quality of income estimates. Ideally there would be no correlation, a result indicating that uncertainty at the census tract scale is not a systematic function of the place. We thus model characteristics of places along three dimensions: socio-demographics and diversity, housing and residential environment and tract structure. The first dimension considers residents’ demographic characteristics (race, ethnicity and diversity), along with a proxy for income in the form of the number of federally subsidized 15 housing units (hud total).8 Diversity is measured using the Simpson index (also known as the Herfindahl index), which takes higher values if the population is spread more evenly across groups and lower ones if the population is more concentrated in a single group within the tract. We measure diversity in resident age and race/ethnicity of residents, with the expectation that more diverse places will have higher income variability and thus higher mean CV since the population is not of uniform type. The second dimension captures variables on the types of housing units in the place (group quarters population and vacant units). The final dimension investigates the census tract as a unit of analysis by looking at variation in land area, whether the tract population was stable over the previous decade proxied by tracts that are split due to population growth pushing them over the maximum size threshold or merged due to population decline placing them below the minimum size threshold, and population (with more populous tracts expected to have less uncertainty). The unit of analysis is always the census tract. Although there are high level parameters that constrain tract delineation (U.S. Census Bureau, 1994), exceptions do occur. For this reason we only include census tracts with a population greater than 500 and with more than 200 housing units. In general this excludes anomalous tracts such as national parks, lakes, large prisons, large college dormitories, etc. We also exclude tracts that have a household income estimate, but no associated MOE; these generally occur when the median household income estimate is at the reporting bounds of $2,499 or $250,001. Combined, these steps remove 1,186 tracts. The geography is further constrained to the lower continental U.S., leaving 71,353 census tracts for the national analysis and 58,386 for the MSA analysis. The Northern region includes 33,093 tracts compared to 38,260 tracts in the South. 5 Findings Generally, the models indicate statistically significant differences between the north and south (Chow tests9 , Table 2). Investigating the individual variables within the model shows 8 Federally subsidized housing programs include public housing (traditional and HOPE VI), multi-family housing (including housing for the elderly and disabled, Section 202 and 811), and vouchers (predominantly Housing Choice Vouchers for tenants). Address-level records were aggregated to the tract level. 9 The Chow test identifies if there is a statistical difference between the magnitude of regression coefficients in the north and south models. 16 Table 1: Description of Dependent and Independent Variables Variables Expected Sign Description Dependent Variable Income CV Median household income CV Source 2006–10 ACS Housing units responding to ACS - 2006–10 ACS Socio-demographics (X) hud total Federally subsidized housing units African-American population rate black rt hisp rt Hispanic population rate race simp Racial/ethnic diversity (Simpson) Resident age diversity (Simpson) age simp + + + + + HUD 2010 Census 2010 Census 2010 Census 2010 Census Housing and Residential Structure (H) group pop Population in group quarters vacant Vacant housing units + + 2010 Census 2010 Census Tract Structure (T ) area Land area 2000–10 tract boundary stability tr nochange* dist2center** Distance to urban core population Total residents + - 2010 Census 2000–10 Census† 2010 Census‡ 2010 Census hu respond All variables measured in levels unless noted in the table. Natural logarithms are taken of each variable before use in the econometric model. * Dummy variable ** Only included in metropolitan area regressions † Computed by authors based on physical census tract boundary changes between 2000 and 2010 reported in the 2010 Census Tract Relationship Files. ‡ Urban centers identified using the U.S. Geological Survey’s Geographic Names Information System. The latitude-longitude marker for the first city listed in the MSA name is extracted from the database, and then distances are computed from each census tract centroid to the urban center. 17 that most have the same directional effect on CV, but at different magnitudes. An exception is that increased land area leads to higher CV in the South and lower in the North. Racial diversity and changes to tract boundaries are both insignificant in the North but significant in the South. We further find that increases in response level and age diversity reduce CV more in the North than the South, and increases in percent African American increase CV more in the North than South. Increases in total tract population have a greater impact on CV reduction in the South than the North. Table 2: National and Regional Regression Results National Model MSA Model North Regimes Model South Chow Test Constant hu respond 1.4105*** -0.495*** 1.5197*** -0.4649*** 1.418*** -0.4732*** 1.5811*** -0.4279*** ** *** hud Total black rt hisp rt race simp age simp 0.0338*** 0.1384*** 0.2046*** -0.0189*** -0.1417*** 0.0334*** 0.101*** 0.2293*** -0.0367*** -0.1529*** 0.0356*** 0.1676*** 0.2696*** -0.0111 -0.2515*** 0.0347*** 0.0817*** 0.2104*** -0.0615*** -0.0929*** *** * *** *** group pop vacant 0.0115*** 0.1167*** 0.0131*** 0.113*** 0.0125*** 0.1141*** 0.0132*** 0.1013*** ** area tr nochange population dist2center 0.0068*** 0.0272*** -0.1801*** -0.0009 0.0233*** -0.1968*** -0.0076*** -0.011*** 0.0042 -0.1755*** 0.0216*** 0.029*** -0.2348*** *** *** *** λ Pseudo R2 N 0.1597*** 0.3209 71,353 0.1426*** 0.3154 58,386 0.1102*** 0.3839 33,093 0.1449*** 0.2652 38,260 *** *** Notes: Dependent variable: median household income coefficient of variation All variables in logarithms, except dummy variable tr nochange Significance: *0.10, **0.05, ***0.01 Pseudo R2 is the squared correlation between the actual and predicted dependent variable Mirroring the results from Section 3, we find the proxy for low income neighborhoods (hud total) is positively associated with income CV after controlling for the other relevant factors. A one percent increase in the number of HUD-assisted low-income rental units results in a 0.03-0.04 percent increase in CV. Moreover, this result remains stable across 18 regions, as indicated by the insignificant Chow test (Table 2). Next, the finding from Figure 7 that uncertainty decreases with distance from city center also holds in a multivariate context. After controlling for 11 other determinants of income CV, a one percent increase in distance from the city center is associated with a small yet significant percent reduction in CV (-0.0076). As mentioned, this result was only computed for the subset of tracts in metropolitan areas. As expected, a key component of income CV relates to survey response. Of all determinants, higher response levels (hu respond) are most strongly, significantly and negatively related to CV across the country, in urban areas and in both regions (Table 2). In other words, more survey responses are associated with less uncertainty. For the country as a whole, a one percent increase in the number of responding housing units results in a half percent reduction (-0.495) in CV, which reinforces the premise that more raw data from which to build the estimates results in lower uncertainty in those estimates. The removal of the covariates in X, H and T from Equation 1 has little influence on the estimated impact of the response level. The relationship between response level and income CV is lowest in urban areas (-0.4649) and the South (-0.4279). The difference in this relation between northern and southern areas is significant (p = 0.01) as indicated by the Chow test. Although most variables had the expected impact on uncertainty as outlined in Table 1, we did find an inverse relationship between racial/ethnic diversity and income CV after holding a tract’s share of African-American and Hispanic residents constant. In other words, higher levels of racial and ethnic segregation (i.e., lower diversity) is associated with larger income CVs. This relationship is strongest in the southern region where a one percent increase in the Simpson index (indicating more diversity) results in a 0.06 percent decrease in income CV, compared to 0.01 in the North. The Chow test shows that this difference is highly significant (p = 0.01). We found the same pattern with age diversity. Given that the response level is controlled we had expected more homogeneous places to be easier to measure and thus have lower uncertainty. Population stability (tr nochange), measured by lack of change in tract boundaries from 2000 to 2010, also had counterintuitive results. If tract boundary changes indicate large 19 population growth or decline, then we would expect this instability to lead to measurement challenges. However, we find that more stability leads to higher CV, except in the North where it is insignificant. The physical land area of a census tract turned out to be the most unstable variable across the four models. Larger land areas reflect lower population densities, but the substantive implications of population density varies between the metro only models and the national scale models. In the national scale models this variables equates to more “rural” places, in the metro models tract size reflects the character of the built environment. Tract size is positively associated with household income CV in the southern region and the nation overall, but is insignificant in the urban model and negative in the North. Hence the Chow test is highly significant (p = 0.01) for the difference between between the North (0.011 percent decrease in CV) and the South (0.0216 increase). The model for the northern region has the largest explanatory power (pseudo R2 of 0.3839), followed in order by that for the country overall, urban areas and the southern region. The parameter λ, alluding to the spatial dependence in the error terms, is highly significant in all models with the smallest magnitude in the northern region (0.1102). Although tests for multicollinearity (not reported) suggest its presence, a simulation experiment indicates negligible differences in the final estimates.10 6 Implications As the previous sections show, the quality of median household income estimates from the ACS is correlated with both geographic location and the attributes of places. These results are robust even after controlling for the number of respondents in the tract, which is the strongest determinant of uncertainty. The implications of this variation can be profound. It is not uncommon today for researchers across the social sciences to present a map of demographic data, and then draw inferences from that map. These maps can show clear patterns and lead to statements such as, “housing affordability is lower in the eastern part 10 The simulations repeatedly dropped 10 percent of the observations and then reestimated the model. The distribution of parameter estimates across the simulations displayed only minor variations in the estimates. 20 of the state than the western” or “inner-city rates are higher than suburban.” However the unequal distribution of uncertainty across these places implies that our confidence in these conclusions should be tempered. Some areas of the map likely have highly reliable estimates, while others vary to the point of making the observed differences meaningless. In this context we discuss approaches to help ACS data users navigate this dilemma. The goal of integrating uncertainty into descriptive reporting of ACS estimates is to add value to tables, charts and maps through clear communication of the error associated with the estimates. The most straightforward example of this is adding a column of MOEs to a table of estimates or including a second map that shows the MOEs. However, these approaches may not be the most effective means of communicating uncertainty information to all audiences, especially those unfamiliar with the interpretation of survey data. It is possible that a more coarse representation of the uncertainty could be used, such as a red-yellowgreen light icon attached to each estimate in a table indicating how much caution should be used when interpreting the value. In a cartographic context, interactions of color intensity or hatching overlays could be used to create a single map that integrates the estimates and MOEs (see for example Sun and Wong, 2010; O’Brien and Cheshire, 2014; Griffin et al., 2014). For additional discussions and approaches, readers are referred to a long research on visualizing geospatial uncertainty (e.g., MacEachren, 1992; MacEachren et al., 2005; Wong and Sun, 2013; Mason et al., 2014; Kinkeldey and Schiewe, 2014). Correlations, regression coefficients and other forms of model output derived from ACS estimates, are fraught with hidden statistical issues since the input data are measured with error. The primary issue is attenuation bias, which causes the magnitude of a statistic (e.g., correlation or regression coefficient) to be reduced when one or more variables are measured with error. Corrections for the correlation coefficient have existed for over a century (Spearman, 1904), but not without controversy (Muchinsky, 1996). In the regression context, the issue is more complex. Standard econometric theory assumes that explanatory variables are deterministic and measured without error. The presence of a variable with error in the design matrix has the potential to taint all the regression coefficients in unpredictable ways (Greene, 2003, chapter 5). The main problem then lies in the interpretation of regression coefficients, which are likely to be biased if the ACS error is large. Various errors-in-variables 21 and instrumental variables types approaches have been considered as potential fixes for this problem (Bound et al., 2001). The spatial structure embedded in the ACS is another option to consider in order to ameliorate the data challenges (see Anselin and Lozano, 2008 for a discussion on the topic). When it comes to the dependent variable, consequences are less severe since the measurement mismatch is transferred to the error component of the regression specification, and this can then be appropriately modeled to account for its structure. One strength of ACS data, as compared to other cases of error in measurement, is that we know the magnitude of the error on each estimate, and can leverage this information in the model specification. Future research along these lines could provide interesting insights for more robust inferences using survey data with systematically varying estimate quality. In recent years the BOC has made a concerted effort to address the challenge of variation in the uncertainty through more sophisticated sampling approaches (Bruce and Robinson, 2009). When the ACS began, an area (a census block) was assigned to one of seven sample rate categories; subsequent research into this initial approach has resulted in an increase to 16 categories (Sommers and Hefter, 2010). However, budget constraints limit the BOC to 3.54 million surveys per year making this a zero sum game—if one area is oversampled another must be under-sampled. This means that the variation in uncertainty is reduced by allowing some places to have worse estimates and others to have better ones. The problem of too few completed surveys upon which to construct reliable estimates can be addressed directly by the user. By combining estimates, either by attribute or geographic area, the user boosts the amount of raw data supporting new derived estimates. For example, a set of three age ranges built by collapsing a set of ten age ranges will nearly always be more reliable (U.S. Census Bureau, 2009a). In the income example, tracts could be classified by income group (e.g., “low” or “high” income) instead of by the continuous variable median income; misclassification will certainly happen for tracts near the threshold, but the problem of income uncertainty is reduced. Alternatively, geographic areas can be combined into custom regions. Folch and Spielman (2014) present an automated approach for geographic aggregation that joins census tracts into “regions” in such a way that estimates for the regions all have CVs below a user defined threshold. These user based approaches take the uncertainty trade-off out of the hands of the BOC and put it directly in the hands of 22 the researcher — in order to achieve the desired reliability, the granularity of the data is sacrificed. 7 Conclusion When we think of measurement error we generally hope that it is low and of the well behaved type that follows a uniform random distribution across all observations. What this research has shown is that we should not make either of these assumptions when working with ACS data. The issue of higher levels of uncertainty for ACS estimates compared to the decennial census is widely recognized among researchers, however the variation in this uncertainty is not as well understood. As this work has shown, uncertainty is not only clustered over space, but the characteristics of places can inform the magnitude of uncertainty. Using median household income at the census tract scale as an exemplar, we find higher uncertainty in urban cores relative to suburban areas, higher uncertainty in lower income areas and a differential pattern in uncertainty in the northern U.S. relative to the southern U.S. These patterns are robust when controlling for the number of respondents to the survey by location, and other correlates of uncertainty. The implication for users is that seemingly straightforward interpretations of ACS estimates require examination of the underlying uncertainty in the estimates. While this is clearly a critical issue for researchers, it does not mean that ACS estimates have too much error at the tract level to be useful in research and decision making. Some places have good data, and to the extent that a research project is confined to these locations there is little to worry about. For those conducting research with mixed uncertainty levels, we highlighted approaches that can reveal the uncertainty information in mapping and tabular formats. Users can also directly control the problem by grouping estimates by attribute or by grouping census tracts into regions, i.e., by reducing the granularity of the data. Creative strategies are needed to both push down overall uncertainty in raw ACS estimates and reduce variation in this uncertainty. Published work by the BOC (e.g., Castro Jr. and Hefter, 2008; Sommers and Hefter, 2010) and actual changes in the ACS show that 23 efforts continue to be made along both these fronts. A strength of the ACS data collection model is that methodological improvements can be made at any time, not needing to wait for a decadal cycle to repeat. This opens the door for innovative ideas from the research community to be integrated into the ACS in a reasonable time frame. Acknowledgements David C. Folch and Seth E. Spielman acknowledge financial support from the National Science Foundation (award number 1132008). Julia Koschinsky acknowledges funding from the National Institutes of Health under Award Number 2-R01CA126858. The authors are solely responsible for the accuracy of the statements and interpretations contained in this publication. Such interpretations do not necessarily reflect the views of the Government. This work used the Python Spatial Analysis Library (Rey and Anselin, 2007, www.pysal.org). 24 References Anselin, L. (1988). Spatial econometrics. Kluwer Academic Publishers Dordrecht. Anselin, L. (1990). Spatial dependence and spatial structural instability in applied regression analysis. Journal of Regional Science, 30(2):185–207. Anselin, L. (1995). Local indicators of spatial association—LISA. Geographical Analysis, 27(2):93–115. Anselin, L. and Lozano, N. (2008). Errors in variables and spatial effects in hedonic house price models of ambient air quality. Empirical Economics, 34(5):5–34. Arraiz, I., Drukker, D., Kelejian, H., and Prucha, I. (2010). A spatial Cliff-Ord-type model with heteroskedastic innovations: small and large sample results. Journal of Regional Science, 50(2):592–614. Bazuin, J. T. and Fraser, J. C. (2013). How the ACS gets it wrong: The story of the American Community Survey and a small, inner city neighborhood. Applied Geography, 45(December):292–302. Bound, J., Brown, C., and Mathiowetz, N. (2001). Measurement error in survey data. Handbook of econometrics, 5:3705–3843. Bruce, A. and Robinson, J. G. (2009). Tract level planning database with census 2000 data. Technical report, U.S. Census Bureau. Castro Jr., E. C. and Hefter, S. P. (2008). Redesigning the American Community Survey computer assisted personal interview sample. In Proceedings of the Survey Research Methods Section, American Statistical Association. Citro, C. F. and Kalton, G. (2007). Using the American Community Survey: benefits and challenges. National Academies Press, Washington, D.C. ESRI (2011). The American Community Survey. Technical report, ESRI. Folch, D. C. and Spielman, S. E. (2014). Identifying regions based on flexible user-defined constraints. International Journal of Geographical Information Science, 28(1):164–184. Greene, W. (2003). Econometric analysis. Prentice Hall Upper Saddle River, NJ. Griffin, A. L., Spielman, S. E., Jurjevich, J., Merrick, M., Nagle, N. N., and Folch, D. C. (2014). Supporting planners’ work with uncertain demographic data. In Proc. 23rd International Symposium on Distributed Computing (University of Technology, Austria, September 2014), Berlin, Germany. Springer. Kinkeldey, C. and Schiewe, J., editors (2014). Expert interviews about the use of visually depicted uncertainty for analysis of remotely sensed land cover change, Proc. 23rd International Symposium on Distributed Computing (University of Technology, Austria, September 2014), Lecture Notes in Computer Science, Berlin, Germany. Springer. 25 MacDonald, H. (2006). The American Community Survey: Warmer (more current), but fuzzier (less precise) than the decennial census. Journal of the American Planning Association, 72(4):491–503. MacEachren, A. M. (1992). Visualizing uncertain information. Cartographic Perspectives, Fall(13):10–19. MacEachren, A. M., Robinson, A., Hopper, S., Gardner, S., Murray, R., Gahegan, M., and Hetzler, E. (2005). Visualizing geospatial information uncertainty: What we know and what we need to know. Cartography and Geographic Information Science, 32(3):139–160. Mason, J., Retchless, D., and Klippel, A., editors (2014). Dimensions of Uncertainty: A Visual Classification of Geospatial Uncertainty Visualization Research), Proc. 23rd International Symposium on Distributed Computing (University of Technology, Austria, September 2014), Lecture Notes in Computer Science, Berlin, Germany. Springer. Muchinsky, P. M. (1996). The correction for attenuation. Educational and Psychological Measurement, 56(1):63–75. O’Brien, O. and Cheshire, J. (2014). Mapping geodemographic classification uncertainty an exploration of visual techniques using compositing operations. In Proc. 23rd International Symposium on Distributed Computing (University of Technology, Austria, September 2014), Berlin, Germany. Springer. Rey, S. J. and Anselin, L. (2007). PySAL: A python library of spatial analytical methods. The Review of Regional Studies, 37(1):5–27. Salvo, J. J. and Lobo, A. P. (2006). Moving from a decennial census to a continuous measurement survey: factors affecting nonresponse at the neighborhood level. Population Research and Policy Review, 25(3):225–241. Sommers, D. and Hefter, S. P. (2010). American Community Survey sample stratification current and new methodology. Technical report, U.S. Census Bureau. Spearman, C. (1904). The proof and measurement of association between two things. The American journal of psychology, 15(1):72–101. Spielman, S. E., Folch, D. C., and Nagle, N. N. (2014). Patterns and causes of uncertainty in the American Community Survey. Applied Geography, 46:147–157. Sun, M. and Wong, D. W. S. (2010). Incorporating data quality information in mapping American Community Survey data. Cartography and Geographic Information Science, 37(4):285–299. U.S. Census Bureau (1994). Geographic areas reference manual. Technical report, U.S. Census Bureau. U.S. Census Bureau (2009a). A Compass for Understanding and Using American Community Survey Data: What Researchers Need to Know. U.S. Government Printing Office, Washington, DC. 26 U.S. Census Bureau (2009b). Design and Methodology: American Community Survey. U.S. Government Printing Office, Washington, D.C. Wong, D. W. and Sun, M. (2013). Handling data quality information of survey data in GIS: A case of using the American Community Survey data. Spatial Demography, 1(1):3–16. 27