Weighting and imputation

advertisement
Weighting and imputation
PHC 6716
July 13, 2011
Chris McCarty
Weighting
•
Weighting is the process of adjusting the contribution of each observation in a
survey sample based on independent knowledge about appropriate distributions
•
Before weighting the implied weight of each observation is 1.0
•
After weighting, some observations will have weights >1.0 and some <1.0, and
some at 1.0
•
No observations should have a weight of 0
•
Two general types of weighting:
– Design weights -- Adjusting for differences due to intentional disproportionate
sampling (e.g. over-sampling African Americans or certain regions)
– Post-stratification weights -- Adjusting for differences in population or
households when release of sample is intended to be representative (e.g.
adjustments for non-response of young people)
Common sources for calculating
weights
• U.S. Census
• Current Population Survey
• American Community Survey
• For Florida County, Age, Race, Ethnicity the
BEBR Population Program
How frequency procedures use weights
• All statistical packages have options on
procedures to incorporate weights
• For frequency procedures the weights are
multiplied by the unweighted frequencies,
then percentages are calculated on the result
How to make a simple weight
A
Region
B
Frequency
C
Percent
Of
Sample
D
Percent
From
Other
Source
E
Weight
(D/C)
F
Adjusted
Frequency
(B*E)
G
Adjusted
Percent
North
1
10.0
25.0
2.5
2.5
25.0
South
4
40.0
25.0
0.625
2.5
25.0
East
2
20.0
25.0
1.25
2.5
25.0
West
3
30.0
25.0
0.833
2.5
25.0
Total
10
100.00
100.00
-
10
100.00
What that would look like in data set
Observation
Region
Employed
Weight
1
N
Y
2.5
2
S
N
0.625
3
S
Y
0.625
4
S
Y
0.625
5
S
N
0.625
6
E
N
1.25
7
E
N
1.25
8
W
Y
0.833
9
W
Y
0.833
10
W
N
0.833
Total
-
-
9.99
Original and adjusted frequency of
Employment variable
Employed Frequency Percent Frequency Adjusted
to Weights*
Percent Adjusted to
Weights
Y
5
50.00
5.416
54.21
N
5
50.00
4.583
45.87
Total
10
100.00
9.99
100.00
*This is the sum of the weights for the category
Notes on weighting
• Typically you don’t want weights to make enormous
differences
• Keep in mind that with weighting you are saying you have
information extraneous to the survey process that informs
you of the proper distribution
• You could conceivably up-weight results from a small sample
strata
• Weights are typically used for accurate estimates of
prevalence
• Models where you test relationships do not need weights if
you include the variables you would use to weight
Weighting with more than one variable
• Combined weight with multiplication
– Create individual weights for each variable then multiply
weights to get a single weight (Wage*Wgender)
– Not a good solution with a lot of variables
• Combine weights iteratively
– Calculate weight for a variable using frequency table
– Use that weight in frequency of second variable to create
weight
– Use that weight in frequency of third variable to create weight
– And so on
Consumer Confidence
• Survey of approximately 500 Florida households
each month
• RDD Landline Survey
• Five questions (components) averaged into an
Overall Index
• Until now only post-stratification weighting by
proportion of households by county
Potential weighting variables
County
• Typically we get underrepresentation from large
south Florida counties
(Miami-Dade) and overrepresentation from
northern counties (Alachua)
• Household proportions are
estimated between census
years by BEBR
• Weights June 2011.xls
Potential weighting variables
Age
• RDD tends to lead to oversampling of seniors with
landlines
• Cell phones emerged as a
problem around 2005
• No reliable age group data
until 2010 Census
• Elderly tend to be less
confident than younger
respondents due to fixed
incomes
• Weights June 2011.xls
Potential weighting variables
Hispanic Ethnicity
• Cell phones tend to be used
disproportionately by
Hispanics
• 2010 Census provided reliable
data about proportion of
Hispanic Floridians
• Hispanics tend to have lower
confidence than non-Hispanics
• Weights June 2011.xls
Potential weighting variables
Gender
• Cell phones are
disproportionately used
by young males
• Monthly CCI uses
youngest male/oldest
female respondent
selection
• Weights June 2011.xls
2011-3
2010-12
2010-9
2010-6
2010-3
2009-12
2009-9
2009-6
2009-3
2008-12
2008-9
2008-6
2008-3
2011-4
2007-9
2007-6
2007-3
2006-12
2006-9
2006-6
2006-3
2005-12
2005-9
2005-6
2005-3
YM
Overall Index
120
100
80
indexus
60
indexus_cnty
indexus_cnty_a
indexus_cnty_a_h
40
indexus_cnty_a_h_s
20
0
YM
2005-2
2005-4
2005-6
2005-8
2005-10
2005-12
2006-2
2006-4
2006-6
2006-8
2006-10
2006-12
2007-2
2007-4
2007-6
2007-8
2007-10
2011-4
2008-2
2008-4
2008-6
2008-8
2008-10
2008-12
2009-2
2009-4
2009-6
2009-8
2009-10
2009-12
2010-2
2010-4
2010-6
2010-8
2010-10
2010-12
2011-2
2011-4
Personal Finances Now Comapred to a Year Ago
100
90
80
70
60
icurfin
50
icurfin_cnty
40
icurfin_cnty_a
icurfin_cnty_a_h
30
icurfin_cnty_a_h_s
20
10
0
YM
2005-2
2005-4
2005-6
2005-8
2005-10
2005-12
2006-2
2006-4
2006-6
2006-8
2006-10
2006-12
2007-2
2007-4
2007-6
2007-8
2007-10
2011-4
2008-2
2008-4
2008-6
2008-8
2008-10
2008-12
2009-2
2009-4
2009-6
2009-8
2009-10
2009-12
2010-2
2010-4
2010-6
2010-8
2010-10
2010-12
2011-2
2011-4
Personal Finances Expected a Year From Now
120
100
80
ifutfin
60
ifutfin_cnty
ifutfin_cnty_a
ifutfin_cnty_a_h
40
ifutfin_cnty_a_h_s
20
0
YM
2005-2
2005-4
2005-6
2005-8
2005-10
2005-12
2006-2
2006-4
2006-6
2006-8
2006-10
2006-12
2007-2
2007-4
2007-6
2007-8
2007-10
2011-4
2008-2
2008-4
2008-6
2008-8
2008-10
2008-12
2009-2
2009-4
2009-6
2009-8
2009-10
2009-12
2010-2
2010-4
2010-6
2010-8
2010-10
2010-12
2011-2
2011-4
US Economic Conditions Over Next Year
100
90
80
70
60
iusfufi
50
iusfufi_cnty
40
iusfufi_cnty_a
iusfufi_cnty_a_h
30
iusfufi_cnty_a_h_s
20
10
0
YM
2005-2
2005-4
2005-6
2005-8
2005-10
2005-12
2006-2
2006-4
2006-6
2006-8
2006-10
2006-12
2007-2
2007-4
2007-6
2007-8
2007-10
2011-4
2008-2
2008-4
2008-6
2008-8
2008-10
2008-12
2009-2
2009-4
2009-6
2009-8
2009-10
2009-12
2010-2
2010-4
2010-6
2010-8
2010-10
2010-12
2011-2
2011-4
US Economic Conditions Over Next 5 years
100
90
80
70
60
iusnex5
50
iusnex5_cnty
40
iusnex5_cnty_a
iusnex5_cnty_a_h
30
iusnex5_cnty_a_h_s
20
10
0
YM
2005-2
2005-4
2005-6
2005-8
2005-10
2005-12
2006-2
2006-4
2006-6
2006-8
2006-10
2006-12
2007-2
2007-4
2007-6
2007-8
2007-10
2011-4
2008-2
2008-4
2008-6
2008-8
2008-10
2008-12
2009-2
2009-4
2009-6
2009-8
2009-10
2009-12
2010-2
2010-4
2010-6
2010-8
2010-10
2010-12
2011-2
2011-4
Good Time to Buy Big Ticket Items
140
120
100
80
igbtime
igbtime_cnty
60
igbtime_cnty_a
igbtime_cnty_a_h
40
igbtime_cnty_a_h_s
20
0
2011-3
2010-12
2010-9
2010-6
2010-3
2009-12
2009-9
2009-6
2009-3
2008-12
2008-9
2008-6
2008-3
2011-4
2007-9
2007-6
2007-3
2006-12
2006-9
2006-6
2006-3
2005-12
2005-9
2005-6
2005-3
YM
Overall Index -- Closeup
100
95
90
85
80
indexus
indexus_cnty
75
indexus_cnty_a
indexus_cnty_a_h
70
indexus_cnty_a_h_s
65
60
55
YM
2005-2
2005-4
2005-6
2005-8
2005-10
2005-12
2006-2
2006-4
2006-6
2006-8
2006-10
2006-12
2007-2
2007-4
2007-6
2007-8
2007-10
2011-4
2008-2
2008-4
2008-6
2008-8
2008-10
2008-12
2009-2
2009-4
2009-6
2009-8
2009-10
2009-12
2010-2
2010-4
2010-6
2010-8
2010-10
2010-12
2011-2
2011-4
Personal Finances Now Comapred to a Year Ago - Closeup
105
95
85
75
icurfin
icurfin_cnty
65
icurfin_cnty_a
icurfin_cnty_a_h
55
icurfin_cnty_a_h_s
45
35
Example 2- FHIS
•
The state of Florida wanted to estimate rates of the uninsured
•
They stratified the state into 17 regions and wanted to be able to make estimates
for the state and each region with a tolerable margin of error
•
On the state level they wanted to be able to say something about Blacks, Hispanics
and those under 200 percent of the poverty level
•
Data on these demographics for each Florida telephone exchange were obtained
prior to sampling
•
Strata were created from exchanges
•
This made it possible to create weights based on known households in each
exchange
•
This required design weights to adjust for disproportionate sampling
Example 3 – Medicaid survey
• The state wanted to evaluate Medicaid Reform being conducted in Duval
and Broward counties
• They wanted to administer a modified CAHPS instrument to Adults and
Children separately
• They wanted to stratify by plan as well, sampling a minimum number of
observations per plan
• In the end they wanted to compare plans, counties and adults and
children
• These weights required knowledge about total enrollment for each one of
these characteristics (plan, age, county)
Imputation
• Like weighting, imputation involves adjusting the analysis after data
collection
• Unlike weighting, imputation is the deliberate creation of data that
were not actually collected
• The main reason for imputation is to retain observations in a
statistical analysis that would otherwise be left out
• Your ability to discover significant results may be compromised by
too many missing values
• In some case there may be systematic bias associated with missing
data so that not imputing presents an unrepresentative result
Imputation and regressions
• Imputation is particularly common when data are analyzed
with regression analysis
• A regression model explains the variability in a dependent
variable using one or more independent variables
• Observations can only be included in the regression if they
have values for all variables in the model
• Models with a lot of variables increase the probability that an
observation will have at least one missing value for them
Example Model
Income = β1(Age) + β2(Education)+ β3(Employed)
Imputation algorithms
• Two general categories
– Random imputation assigns values randomly, often based
on a desired statistical distribution
– Deterministic imputation typically assigns values based on
existing knowledge
• Existing knowledge could be in the data
– Single imputation fills missing data with one value, such as the mean
of all non-missing values for a continuous variable
– Multiple imputation fills in missing data with a set of plausible values
– Hot deck imputation fills in missing values with those of an
observation that matches on key variables
Download