ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR Conference, New Orleans May 16, 2008 1 Public Data • The American Community Survey (ACS) produces an annual Public Use Microdata Sample (PUMS) file. • You can download these files for free. • Write your own program to tally and analyze data. 2 Key Points • PUMS data users want to know the reliability of an estimate. • This paper explains how to use PUMS replicate weights to estimate standard errors. 3 Outline • the American Community Survey (ACS) • the Public Use Microdata Sample (PUMS) – – – – – sample design confidentiality weights standard errors issues with standard errors 4 The American Community Survey • The 2005 ACS – Sample of 250,000 housing units per month. – Every county represented in the fifty states, District of Columbia and Puerto Rico. – Collects population and housing characteristics • The 2006 ACS was similar but added – A sample of both institutional and noninstitutional Group Quarters population. – GQ sample size was 16,000 persons per month 5 PUMS Sample Design • PUMS is a subsample of ACS – Sort the ACS interviews on geography, mode of interview, types of housing units, demographics – Sample size: • one percent of the total HUs and HH persons in 2005 and 2006. • one percent of total GQ persons in 2006 – Systematic sampling at the state and PUMA level. 6 PUMA Definition • PUMA - Public Use Microdata Area – Designed for public release of information by local state officials. – Large enough to achieve disclosure avoidance. • An area of 100,000 population or more as of the 2000 Census. 7 PUMS Protects Confidentiality • PUMS does not reveal: – – – – Names of persons. Address. Detailed Type of group quarters. Geographic data below the PUMA level. • The respondent’s identity is protected. – – – – Top-coding of age, income and other variables. Data swapping Synthetic data Perturbation of data 8 Rural PUMAs in KY 9 PUMAs in Baltimore Co., MD 10 PUMS Weighting • The PUMS initial weight was equal to the ACS final weight times the sampling interval. • The 2006 PUMS file was ratio-estimated to ACS – persons in households by sex by PUMA – housing units by vacant/occupied by PUMA – persons in group quarters by institutional/ noninstitutional by state 11 How to Program an Estimate – Counts, Aggregates, Ratios, Medians • Totals (counts) – Sum the PUMS weights (for the characteristic). • Aggregates – Sum the product of the PUMS weight times the value • Ratios – Form the total or aggregate for the numerator – Sum the PUMS weights for the characteristic in the denominator – Divide • Medians – use weighted distributions 12 ACS Standard Errors • The ACS uses the successive difference model of replicate weights to estimate standard errors. • The successive difference model of Kirk Wolter was developed for ACS by Robert Fay and George Train. http://www.census.gov/hhes/www/saipe/asapaper/FayTrain95.pdf 13 Two Methods for PUMS Standard Errors • Design factor method – Design factors are factors to multiply times the standard error of a simple random sample. – Easier to use than the replicate weights • Replicate weight method – Generally, you get a more accurate standard error estimate by using the replicate weights. – Somewhat more work than design factors. http://acsweb2.acs.census.gov/acs/www/Downloads/2006/AccuracyPUMS.pdf 14 Three Steps to Standard Errors Using Replicate Weights • Write a program to derive an estimate using the PUMS weight. • Run the program 80 more times using each of the 80 replicate weights. • Use the PUMS estimate and the 80 replicate estimates in the Standard Error formula. 15 ACS PUMS Replicate Weight Formula for a Standard Error SE 80 4 2 X r X 80 r 1 • where: – X is the estimate formed from the PUMS weight – Xr is the estimate formed from the rth replicate weight. 16 Standard Errors of Differences • There are two estimates, A and B. • You want to use a Z-test to see if the difference (A – B) is significant. • The Z-test requires the standard error of the difference. 17 For Independent Estimates Use the standard errors of the two estimates to estimate the standard error of the difference. SE (A - B) SEA SEB 2 2 • SEA-B – the standard error of (A – B) • SEA – the standard error of estimate A • SE B– the standard error of estimate B 18 For Correlated Estimates • Directly use the replicate weights to calculate the standard error of the difference. – Let X = (A - B) = the difference – Let Xr = (Ar – Br ) • for the 80 replicate differences X1 … X80 • Use the replicate weight formula (seen earlier). 19 Replicate Weight Issues • Estimate is zero, standard error is not zero. – Cannot use replicate weights to estimate the standard error. – See the PUMS Accuracy document for a formula. • The replicate standard error is zero, estimate is not zero. – Zero means that if you reselected the sample the answer would be the same. – Acceptable if estimate controlled in the weighting. – Not acceptable if the estimate is a median. Often a direct median gives a zero standard error. 20 Standard Error Options for Medians • Direct median with replicate weights may give a zero standard error. This is not good. • Categorical median with replicate weights will give a more stable standard error, but still some zero standard errors. • Design factor method – Start with either the direct or categorical median, use design factors for the standard error. 21 Conclusion • Replicate weights for ACS PUMS are: – Available for 2005 PUMS and later. – Easy to use for most estimates. – Few issues • For medians – Replicate weight standard errors may be zeros. – To avoid the zeros use the design factor method. 22 References • US Census Bureau: Accuracy of the Data (2006) for ACS is found at: – http://www.census.gov/acs/www/Downloads/ACS/accuracy2006.pdf • US Census Bureau: PUMS Accuracy of the Data (2006) is found at: – http://acsweb2.acs.census.gov/acs/www/Downloads/2006/AccuracyPUMS. pdf • US Census Bureau: Design and Methodology: American Community Survey, Technical Paper 67, May 2006, – http://www.census.gov/acs/www/Downloads/tp67.pdf • Fay & Train, Aspects of Survey and Model-Based Postcensal Estimation of Income and Poverty Characteristics for States and Counties, 1995 – http://www.census.gov/hhes/www/saipe/asapaper/FayTrain95.pdf 23 Contact Information • For questions about this presentation or for an example program to generate standard errors. • Contact me at B.Dale.Garrett@census.gov Views expressed in this paper are those of the authors and not necessarily those of the U.S. Census Bureau. 24 How to Derive an Estimate – Direct Medians • The direct median is the weighted sample median or the distributional median. • Sum the weights for the characteristic total. • Sort the file on the value of interest. • Sum the weights until the 50% point. • The direct median is the value of the record which crosses the 50% point. • Or a point between the values of two records that divide the file into two exact halves. 25 How to Derive an Estimate – Categorical Medians • Categorical or interpolated medians. – Used for published ACS statistics in Factfinder. • Categorical medians are interpolations: – A weighted distribution of the characteristic. – Each bin or row is assigned a range of values. – Uses linear interpolation for most variables. 26 Direct Median Example Based on 5 Records Record # Percent of Total Income Direct from median record 1 18% 18,000 2 22% 33,000 3 20% 41,000 4 15% 49,000 5 25% 62,000 41,000 27 Direct and Categorical Medians Example Based on 5 Records Income Range Record Percent # of Total Income Direct Categorical from median median record -59,000 to 20,000 1 18% 18,000 20,000 to 40,000 2 22% 33,000 40,000 to 60,000 3 20% 41,000 4 15% 49,000 60,000 + 5 25% 62,000 41,000 45,700 28