ACS Public Use Microdata Samples of 2005 and 2006

advertisement
ACS Public Use Microdata Samples
of 2005 and 2006 –
How to Use the Replicate
Weights
B. Dale Garrett and Michael Starsinic
U.S. Census Bureau
AAPOR Conference, New Orleans
May 16, 2008
1
Public Data
• The American Community Survey (ACS)
produces an annual Public Use Microdata
Sample (PUMS) file.
• You can download these files for free.
• Write your own program to tally and analyze
data.
2
Key Points
• PUMS data users want to know the reliability
of an estimate.
• This paper explains how to use PUMS
replicate weights to estimate standard errors.
3
Outline
• the American Community Survey (ACS)
• the Public Use Microdata Sample (PUMS)
–
–
–
–
–
sample design
confidentiality
weights
standard errors
issues with standard errors
4
The American Community
Survey
• The 2005 ACS
– Sample of 250,000 housing units per month.
– Every county represented in the fifty states,
District of Columbia and Puerto Rico.
– Collects population and housing characteristics
• The 2006 ACS was similar but added
– A sample of both institutional and noninstitutional
Group Quarters population.
– GQ sample size was 16,000 persons per month
5
PUMS Sample Design
• PUMS is a subsample of ACS
– Sort the ACS interviews on geography, mode of
interview, types of housing units, demographics
– Sample size:
• one percent of the total HUs and HH persons in 2005
and 2006.
• one percent of total GQ persons in 2006
– Systematic sampling at the state and PUMA level.
6
PUMA Definition
• PUMA - Public Use Microdata
Area
– Designed for public release of information by local
state officials.
– Large enough to achieve disclosure avoidance.
• An area of 100,000 population or more as of the 2000
Census.
7
PUMS Protects Confidentiality
• PUMS does not reveal:
–
–
–
–
Names of persons.
Address.
Detailed Type of group quarters.
Geographic data below the PUMA level.
• The respondent’s identity is protected.
–
–
–
–
Top-coding of age, income and other variables.
Data swapping
Synthetic data
Perturbation of data
8
Rural PUMAs in KY
9
PUMAs in Baltimore Co., MD
10
PUMS Weighting
• The PUMS initial weight was equal to the
ACS final weight times the sampling interval.
• The 2006 PUMS file was ratio-estimated to
ACS
– persons in households by sex by PUMA
– housing units by vacant/occupied by PUMA
– persons in group quarters by institutional/
noninstitutional by state
11
How to Program an Estimate –
Counts, Aggregates, Ratios, Medians
• Totals (counts)
– Sum the PUMS weights (for the characteristic).
• Aggregates
– Sum the product of the PUMS weight times the
value
• Ratios
– Form the total or aggregate for the numerator
– Sum the PUMS weights for the characteristic in
the denominator
– Divide
• Medians – use weighted distributions
12
ACS Standard Errors
• The ACS uses the successive difference
model of replicate weights to estimate
standard errors.
• The successive difference model of Kirk
Wolter was developed for ACS by Robert Fay
and George Train.
http://www.census.gov/hhes/www/saipe/asapaper/FayTrain95.pdf
13
Two Methods for PUMS
Standard Errors
• Design factor method
– Design factors are factors to multiply times the
standard error of a simple random sample.
– Easier to use than the replicate weights
• Replicate weight method
– Generally, you get a more accurate standard error
estimate by using the replicate weights.
– Somewhat more work than design factors.
http://acsweb2.acs.census.gov/acs/www/Downloads/2006/AccuracyPUMS.pdf
14
Three Steps to Standard Errors
Using Replicate Weights
• Write a program to derive an estimate using
the PUMS weight.
• Run the program 80 more times using each
of the 80 replicate weights.
• Use the PUMS estimate and the 80 replicate
estimates in the Standard Error formula.
15
ACS PUMS Replicate Weight
Formula for a Standard Error
SE 
80
4
2
X r  X 

80 r 1
• where:
– X is the estimate formed from the PUMS weight
– Xr is the estimate formed from the rth replicate
weight.
16
Standard Errors of Differences
• There are two estimates, A and B.
• You want to use a Z-test to see if the
difference (A – B) is significant.
• The Z-test requires the standard error of the
difference.
17
For Independent Estimates
Use the standard errors of the two estimates to
estimate the standard error of the difference.
SE (A - B)  SEA  SEB
2
2
• SEA-B – the standard error of (A – B)
• SEA – the standard error of estimate A
• SE B– the standard error of estimate B
18
For Correlated Estimates
• Directly use the replicate weights to calculate
the standard error of the difference.
– Let X = (A - B) = the difference
– Let Xr = (Ar – Br )
• for the 80 replicate differences X1 … X80
• Use the replicate weight formula (seen earlier).
19
Replicate Weight Issues
• Estimate is zero, standard error is not zero.
– Cannot use replicate weights to estimate the
standard error.
– See the PUMS Accuracy document for a formula.
• The replicate standard error is zero, estimate
is not zero.
– Zero means that if you reselected the sample the
answer would be the same.
– Acceptable if estimate controlled in the weighting.
– Not acceptable if the estimate is a median. Often
a direct median gives a zero standard error.
20
Standard Error Options for
Medians
• Direct median with replicate weights may give
a zero standard error. This is not good.
• Categorical median with replicate weights will
give a more stable standard error, but still
some zero standard errors.
• Design factor method – Start with either the
direct or categorical median, use design
factors for the standard error.
21
Conclusion
• Replicate weights for ACS PUMS are:
– Available for 2005 PUMS and later.
– Easy to use for most estimates.
– Few issues
• For medians
– Replicate weight standard errors may be zeros.
– To avoid the zeros use the design factor method.
22
References
•
US Census Bureau: Accuracy of the Data (2006) for ACS is found at:
– http://www.census.gov/acs/www/Downloads/ACS/accuracy2006.pdf
•
US Census Bureau: PUMS Accuracy of the Data (2006) is found at:
– http://acsweb2.acs.census.gov/acs/www/Downloads/2006/AccuracyPUMS.
pdf
•
US Census Bureau: Design and Methodology: American Community Survey,
Technical Paper 67, May 2006,
– http://www.census.gov/acs/www/Downloads/tp67.pdf
•
Fay & Train, Aspects of Survey and Model-Based Postcensal Estimation of
Income and Poverty Characteristics for States and Counties, 1995
–
http://www.census.gov/hhes/www/saipe/asapaper/FayTrain95.pdf
23
Contact Information
• For questions about this presentation or for
an example program to generate standard
errors.
• Contact me at
B.Dale.Garrett@census.gov
Views expressed in this paper are those of the
authors and not necessarily those of the U.S.
Census Bureau.
24
How to Derive an Estimate –
Direct Medians
• The direct median is the weighted sample
median or the distributional median.
• Sum the weights for the characteristic total.
• Sort the file on the value of interest.
• Sum the weights until the 50% point.
• The direct median is the value of the record
which crosses the 50% point.
• Or a point between the values of two records
that divide the file into two exact halves.
25
How to Derive an Estimate –
Categorical Medians
• Categorical or interpolated medians.
– Used for published ACS statistics in Factfinder.
• Categorical medians are interpolations:
– A weighted distribution of the characteristic.
– Each bin or row is assigned a range of values.
– Uses linear interpolation for most variables.
26
Direct Median
Example Based on 5 Records
Record #
Percent
of Total
Income Direct
from
median
record
1
18%
18,000
2
22%
33,000
3
20%
41,000
4
15%
49,000
5
25%
62,000
41,000
27
Direct and Categorical Medians
Example Based on 5 Records
Income
Range
Record Percent
#
of Total
Income Direct
Categorical
from
median median
record
-59,000
to 20,000
1
18%
18,000
20,000 to
40,000
2
22%
33,000
40,000
to
60,000
3
20%
41,000
4
15%
49,000
60,000 +
5
25%
62,000
41,000
45,700
28
Download