Weighting and imputation PHC 6716 July 13, 2011 Chris McCarty Weighting • Weighting is the process of adjusting the contribution of each observation in a survey sample based on independent knowledge about appropriate distributions • Before weighting the implied weight of each observation is 1.0 • After weighting, some observations will have weights >1.0 and some <1.0, and some at 1.0 • No observations should have a weight of 0 • Two general types of weighting: – Design weights -- Adjusting for differences due to intentional disproportionate sampling (e.g. over-sampling African Americans or certain regions) – Post-stratification weights -- Adjusting for differences in population or households when release of sample is intended to be representative (e.g. adjustments for non-response of young people) Common sources for calculating weights • U.S. Census • Current Population Survey • American Community Survey • For Florida County, Age, Race, Ethnicity the BEBR Population Program How frequency procedures use weights • All statistical packages have options on procedures to incorporate weights • For frequency procedures the weights are multiplied by the unweighted frequencies, then percentages are calculated on the result How to make a simple weight A Region B Frequency C Percent Of Sample D Percent From Other Source E Weight (D/C) F Adjusted Frequency (B*E) G Adjusted Percent North 1 10.0 25.0 2.5 2.5 25.0 South 4 40.0 25.0 0.625 2.5 25.0 East 2 20.0 25.0 1.25 2.5 25.0 West 3 30.0 25.0 0.833 2.5 25.0 Total 10 100.00 100.00 - 10 100.00 What that would look like in data set Observation Region Employed Weight 1 N Y 2.5 2 S N 0.625 3 S Y 0.625 4 S Y 0.625 5 S N 0.625 6 E N 1.25 7 E N 1.25 8 W Y 0.833 9 W Y 0.833 10 W N 0.833 Total - - 9.99 Original and adjusted frequency of Employment variable Employed Frequency Percent Frequency Adjusted to Weights* Percent Adjusted to Weights Y 5 50.00 5.416 54.21 N 5 50.00 4.583 45.87 Total 10 100.00 9.99 100.00 *This is the sum of the weights for the category Notes on weighting • Typically you don’t want weights to make enormous differences • Keep in mind that with weighting you are saying you have information extraneous to the survey process that informs you of the proper distribution • You could conceivably up-weight results from a small sample strata • Weights are typically used for accurate estimates of prevalence • Models where you test relationships do not need weights if you include the variables you would use to weight Weighting with more than one variable • Combined weight with multiplication – Create individual weights for each variable then multiply weights to get a single weight (Wage*Wgender) – Not a good solution with a lot of variables • Combine weights iteratively – Calculate weight for a variable using frequency table – Use that weight in frequency of second variable to create weight – Use that weight in frequency of third variable to create weight – And so on Consumer Confidence • Survey of approximately 500 Florida households each month • RDD Landline Survey • Five questions (components) averaged into an Overall Index • Until now only post-stratification weighting by proportion of households by county Potential weighting variables County • Typically we get underrepresentation from large south Florida counties (Miami-Dade) and overrepresentation from northern counties (Alachua) • Household proportions are estimated between census years by BEBR • Weights June 2011.xls Potential weighting variables Age • RDD tends to lead to oversampling of seniors with landlines • Cell phones emerged as a problem around 2005 • No reliable age group data until 2010 Census • Elderly tend to be less confident than younger respondents due to fixed incomes • Weights June 2011.xls Potential weighting variables Hispanic Ethnicity • Cell phones tend to be used disproportionately by Hispanics • 2010 Census provided reliable data about proportion of Hispanic Floridians • Hispanics tend to have lower confidence than non-Hispanics • Weights June 2011.xls Potential weighting variables Gender • Cell phones are disproportionately used by young males • Monthly CCI uses youngest male/oldest female respondent selection • Weights June 2011.xls 2011-3 2010-12 2010-9 2010-6 2010-3 2009-12 2009-9 2009-6 2009-3 2008-12 2008-9 2008-6 2008-3 2011-4 2007-9 2007-6 2007-3 2006-12 2006-9 2006-6 2006-3 2005-12 2005-9 2005-6 2005-3 YM Overall Index 120 100 80 indexus 60 indexus_cnty indexus_cnty_a indexus_cnty_a_h 40 indexus_cnty_a_h_s 20 0 YM 2005-2 2005-4 2005-6 2005-8 2005-10 2005-12 2006-2 2006-4 2006-6 2006-8 2006-10 2006-12 2007-2 2007-4 2007-6 2007-8 2007-10 2011-4 2008-2 2008-4 2008-6 2008-8 2008-10 2008-12 2009-2 2009-4 2009-6 2009-8 2009-10 2009-12 2010-2 2010-4 2010-6 2010-8 2010-10 2010-12 2011-2 2011-4 Personal Finances Now Comapred to a Year Ago 100 90 80 70 60 icurfin 50 icurfin_cnty 40 icurfin_cnty_a icurfin_cnty_a_h 30 icurfin_cnty_a_h_s 20 10 0 YM 2005-2 2005-4 2005-6 2005-8 2005-10 2005-12 2006-2 2006-4 2006-6 2006-8 2006-10 2006-12 2007-2 2007-4 2007-6 2007-8 2007-10 2011-4 2008-2 2008-4 2008-6 2008-8 2008-10 2008-12 2009-2 2009-4 2009-6 2009-8 2009-10 2009-12 2010-2 2010-4 2010-6 2010-8 2010-10 2010-12 2011-2 2011-4 Personal Finances Expected a Year From Now 120 100 80 ifutfin 60 ifutfin_cnty ifutfin_cnty_a ifutfin_cnty_a_h 40 ifutfin_cnty_a_h_s 20 0 YM 2005-2 2005-4 2005-6 2005-8 2005-10 2005-12 2006-2 2006-4 2006-6 2006-8 2006-10 2006-12 2007-2 2007-4 2007-6 2007-8 2007-10 2011-4 2008-2 2008-4 2008-6 2008-8 2008-10 2008-12 2009-2 2009-4 2009-6 2009-8 2009-10 2009-12 2010-2 2010-4 2010-6 2010-8 2010-10 2010-12 2011-2 2011-4 US Economic Conditions Over Next Year 100 90 80 70 60 iusfufi 50 iusfufi_cnty 40 iusfufi_cnty_a iusfufi_cnty_a_h 30 iusfufi_cnty_a_h_s 20 10 0 YM 2005-2 2005-4 2005-6 2005-8 2005-10 2005-12 2006-2 2006-4 2006-6 2006-8 2006-10 2006-12 2007-2 2007-4 2007-6 2007-8 2007-10 2011-4 2008-2 2008-4 2008-6 2008-8 2008-10 2008-12 2009-2 2009-4 2009-6 2009-8 2009-10 2009-12 2010-2 2010-4 2010-6 2010-8 2010-10 2010-12 2011-2 2011-4 US Economic Conditions Over Next 5 years 100 90 80 70 60 iusnex5 50 iusnex5_cnty 40 iusnex5_cnty_a iusnex5_cnty_a_h 30 iusnex5_cnty_a_h_s 20 10 0 YM 2005-2 2005-4 2005-6 2005-8 2005-10 2005-12 2006-2 2006-4 2006-6 2006-8 2006-10 2006-12 2007-2 2007-4 2007-6 2007-8 2007-10 2011-4 2008-2 2008-4 2008-6 2008-8 2008-10 2008-12 2009-2 2009-4 2009-6 2009-8 2009-10 2009-12 2010-2 2010-4 2010-6 2010-8 2010-10 2010-12 2011-2 2011-4 Good Time to Buy Big Ticket Items 140 120 100 80 igbtime igbtime_cnty 60 igbtime_cnty_a igbtime_cnty_a_h 40 igbtime_cnty_a_h_s 20 0 2011-3 2010-12 2010-9 2010-6 2010-3 2009-12 2009-9 2009-6 2009-3 2008-12 2008-9 2008-6 2008-3 2011-4 2007-9 2007-6 2007-3 2006-12 2006-9 2006-6 2006-3 2005-12 2005-9 2005-6 2005-3 YM Overall Index -- Closeup 100 95 90 85 80 indexus indexus_cnty 75 indexus_cnty_a indexus_cnty_a_h 70 indexus_cnty_a_h_s 65 60 55 YM 2005-2 2005-4 2005-6 2005-8 2005-10 2005-12 2006-2 2006-4 2006-6 2006-8 2006-10 2006-12 2007-2 2007-4 2007-6 2007-8 2007-10 2011-4 2008-2 2008-4 2008-6 2008-8 2008-10 2008-12 2009-2 2009-4 2009-6 2009-8 2009-10 2009-12 2010-2 2010-4 2010-6 2010-8 2010-10 2010-12 2011-2 2011-4 Personal Finances Now Comapred to a Year Ago - Closeup 105 95 85 75 icurfin icurfin_cnty 65 icurfin_cnty_a icurfin_cnty_a_h 55 icurfin_cnty_a_h_s 45 35 Example 2- FHIS • The state of Florida wanted to estimate rates of the uninsured • They stratified the state into 17 regions and wanted to be able to make estimates for the state and each region with a tolerable margin of error • On the state level they wanted to be able to say something about Blacks, Hispanics and those under 200 percent of the poverty level • Data on these demographics for each Florida telephone exchange were obtained prior to sampling • Strata were created from exchanges • This made it possible to create weights based on known households in each exchange • This required design weights to adjust for disproportionate sampling Example 3 – Medicaid survey • The state wanted to evaluate Medicaid Reform being conducted in Duval and Broward counties • They wanted to administer a modified CAHPS instrument to Adults and Children separately • They wanted to stratify by plan as well, sampling a minimum number of observations per plan • In the end they wanted to compare plans, counties and adults and children • These weights required knowledge about total enrollment for each one of these characteristics (plan, age, county) Imputation • Like weighting, imputation involves adjusting the analysis after data collection • Unlike weighting, imputation is the deliberate creation of data that were not actually collected • The main reason for imputation is to retain observations in a statistical analysis that would otherwise be left out • Your ability to discover significant results may be compromised by too many missing values • In some case there may be systematic bias associated with missing data so that not imputing presents an unrepresentative result Imputation and regressions • Imputation is particularly common when data are analyzed with regression analysis • A regression model explains the variability in a dependent variable using one or more independent variables • Observations can only be included in the regression if they have values for all variables in the model • Models with a lot of variables increase the probability that an observation will have at least one missing value for them Example Model Income = β1(Age) + β2(Education)+ β3(Employed) Imputation algorithms • Two general categories – Random imputation assigns values randomly, often based on a desired statistical distribution – Deterministic imputation typically assigns values based on existing knowledge • Existing knowledge could be in the data – Single imputation fills missing data with one value, such as the mean of all non-missing values for a continuous variable – Multiple imputation fills in missing data with a set of plausible values – Hot deck imputation fills in missing values with those of an observation that matches on key variables