PALS Weight Documentation Junia Howell and Rose Medeiros March 15, 2013 This document was created to give a detailed explanation of how the wave 2 longitude weights were created. For breathe of explanation we explain all weight variables in wave 1 and wave 2 weights and when they are appropriate to use in research. Weight Variables Wave 1 pawt—Analysis Weight—Person Level (population scaled)1 PAWT2—Weight adjustment (sample scaled) Wave 2 persweight1—weight adjustment hhweight1—Approximate Wave 1 HH weight baseweight2—Quasi-base weight for wave 2 psweight2—Post-stratified weight (population scaled) nweight2—Post-stratified weight (sample scaled) psweight_w2rake—Raked Post-stratified weight (population scaled) nweight_w2rake—Raked Post-stratified weight (sample scaled) Wave 2 Longitude Weights pre_retent—Retention Weight ploweight_w2rake—Post-stratified Longitude weight (population scaled) loweight_w2rake—Post-stratified Longitude weight (sample scaled) psu_id—Primary Sampling Unit ID 1 This is a sampling weight which in Stata is called a p-weight. The RTI International survey firm labeled it an analysis weight. For clarity and consistency we kept this label yet it should be noted it is actually a sampling weight. Suggested Use of Weights Wave 1—When analyzing ONLY wave 1 pawt—For counts of the population. For example, calculating the number of people in the US population that believe in the existence of God. PAWT2—For regressions or proportions. Since this weight is scaled to the sample, the weights are smaller and will introduce less variability into models. Wave 2—When analyzing ONLY wave 2 psweight_w2rake— For counts of the population. For example, calculating the number of people in the US population that believe in the existence of God. nweight_w2rake— For regressions or proportions. Since this weight is scaled to the sample, the weights are smaller and will introduce less variability into models. Wave 2 Longitude Weights—When analyzing both wave 1 and wave 2 ploweight_w2rake— For counts of the population. For example, calculating the number of people in the US population that changed from not believing in the existence of God in 2006 to believing in the existence of God in 2012. loweight_w2rake— For regressions or proportions. Since this weight is scaled to the sample, the weights are smaller and will introduce less variability into models. Sample Stata Code for Weights To set the weights in Stata the suggested setting are: svyset psu_id [pweight= pawt] svyset psu_id [pweight= PAWT2] svyset psu_id [pweight= psweight_w2rake] svyset psu_id [pweight= nweight_w2rake] svyset psu_id [pweight= ploweight_w2rake] svyset psu_id [pweight= loweight_w2rake] Commands can then be run with the svy prefix. If you are using subsets of the data use the subpop() option instead of using the “if” qualifier. svy : tabulate re_1a svy , subpop(female) : regress re_17 re_1a Explanation of Weight Variables Wave 1 pawt—Analysis Weight—Person Level (population scaled) This variable was created for the wave 1 data by RTI International survey firm. Given that participates were selected out of specific zip codes the weight is first adjusted to control of this geographic based selection. The weights were then post-stratified by gender (2 categories), race/ethnicity (4 categories), and age (3 categories). Since the 2006 American Community Survey (ACS) was not yet released, 2005 ACS totals were used. The total population 18 and over was 215,246,449—which is what the total of this weight sums to. They used the cross tabs of all these categories (i.e. the weights were adjusted within cell). In other words raking was not used to create this weight. For more information regarding the creation of this weight see 3_PALS Sample Weighting-actual RTIreport.pdf. For the population totals used see PALS Weight Summary 06FEB2007.pdf. PAWT2—Weight adjustment (sample scaled) This weight was also created by RTI International survey firm. It adjusts the pawt weight so that it sums to the sample instead of the population total. So the summation of the PAWT2 weights equals 2,610 whereas the summation of the pawt weight is 215,246,449. Wave 2 persweight1—weight adjustment Created by the Abt SRBI survey firm this is the PAWT2 weight from wave 1 with the addition of the 101 new respondents (young adults who became eligible since 2006) added in wave 2. The 101 new respondents were given the same person weight as the person in wave 1 that answered the survey from their household. hhweight1—Approximate Wave 1 HH weight Created by the Abt SRBI survey firm this approximate household weight was created by dividing the person weight (persweight1) by household size, given in wave 1 and recorded in the hr_1 variable and rounding to the nearest tenth. In this way, the person’s weight is assumed of the entire household but then divided up amongst the household members. baseweight2—Quasi-base weight for wave 2 Created by the Abt SRBI survey firm this weight is not mentioned in any of the documentation. It is unclear what exactly the weight is or how it was used. psweight2—Post-stratified weight (population scaled) Created by the Abt SRBI survey firm this is a post-stratified weight that uses the household weight from wave 1 as a base. Like wave 1, they stratified the sample by race, gender and age using the ACS 2010. They state… “The following post-stratification cells were constructed: race/ethnicity by gender for minority groups (Black, Hispanic, Asian); gender by three age groups (18-29, 30-49, 50+) for non-Hispanic White and other/mixed races, for a total of 12 post-stratification cells. The smallest groups were NHWO males under 30 (unweighted count of 41) and Asian males (unweighted count of 52). The largest group was NHWO females over 50 (unweighted count of 229).” nweight2—Post-stratified weight (sample scaled) Created by the Abt SRBI survey firm this weight is scaled to the sample size. Thus, psweight2 was multiplied by the quotient of the total population and the sample size. psweight_w2rake—Raked Post-stratified weight (population scaled) In order for the cross sectional weights to correspond with the longitude weights, we decided to weight wave 2 using the same post-stratification method—raking—chosen for the longitude weights. Following the lead of Abt SRBI we used the estimated household weight, hhweight1, as the base for the new cross sectional weight. Nevertheless, with this weight we raked to the 2011 (instead of the 2010) ACS totals. For more information on raking see our explanation below in our description of the loweight_w2rake weight. The totals used for the ranking process only included the population 18 years old and older and were as follows: Race Totals White 157,901,961 Black 27,947,198 Hispanic 34,532,913 Asian 11,664,324 Other(Multiracial, 5,634,822 American Indian, Other etc…)2 Gender Totals Male 115,448,178 Female 122,233,040 Age Totals 18-29 52,298,454 30-49 83,421,502 50 + 101,961,262 There were three cases where the person who answered in wave 1 does not seem to be the same person that answered in wave 2. While we exclude these three cases from the longitudinal weight and analysis, we include 2 of them as new respondents in wave 2. We This “other” category was calculated by subtracting the White, Black, Hispanic, and Asian totals from the total population over 18 years of age. 2 gave them new IDs and flagged them with the variable flag_w2new so researchers could decide if they wanted to include them or not in their analysis. In weighting process their age and gender reported in wave 2 were used. Since they were not re-asked their race we used the race of the wave 1 respondent who most likely being a family member had the same race as the new respondent. For the third case, it seems as though a new respondent answered for themselves and their father that had been the respondent in wave 1. Thus this case was dropped from the raking process and not given a cross sectional weight. This respondent if flagged via the wave2_use variable and we suggest is not used in any analysis. For more information on how we made these decisions see the do-file “Cleaning Wave 1 and 2 Respondents Based on Age.” For respondents, that seem to be the same person, but gave a different birth year in wave 2 than they had in wave 1 or refused to answer in wave 2 we used their birth year in wave 1 to create the weight. Since all wave 1 interviewers were done in person this seems to be the more reliable data point. There were seven respondents that did not give their age in wave 1 but gave their age in wave 2. Again these were checked to see if they seemed to be the same person by their responses on key variables. After concluding that they were the same respondent, six of their ages from wave 2 were incorporated into wave 1 and used for both the cross section and longitudinal weights. One of their birth years would have meant the respondent was 12 in wave 1. Since no one under 18 was included and the interview was in person we coded this as a typo and imputed a birth year using the age from a randomly selected case in the dataset (i.e. unconditional hotdeck imputation). For a more detailed documentation of this process see the do_file “Cleaning Wave 1 and 2 Respondants Based on Age.” nweight_w2rake—Raked Post-stratified weight (sample scaled) This weight is scales psweight_w2rake to the sample size. In other words, psweight_w2rake was multiplied by a scaling factor, equal to sample size divided by the total population. 𝑛𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒 × 𝑝𝑠𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 Note that psweight_w2rake and nweight_w2rake have considerably less variability (less than half the variance) than their un-raked counterparts psweight2 and nweight2 provided by Abt SRBI survey firm. Therefore, it is our belief that the raked weights may be more appropriate for analyses in regression models. pre_retent—Retention Weight Noting the importance of the geographic weight in the wave 1 survey for accounting for the survey design, we wanted to start with the original person weight—pawt. At the same time, we wanted to take into consideration that some individuals were more likely to continue with the survey than others. While we could re-weight those who stayed in the survey by their characteristics like race, gender and age by using the population data, we wanted to also control for the fact the variables we had in wave 1 that are central to the survey questions, like faith, might also be correlated with attrition. Thus, we created an attrition model that predicted the likelihood of respondents remaining in the sample in wave 2. To create this attrition model we first flagged all the variables that were asked of all respondents of wave one. In other words, we are excluding all the variables that were part of skip patterns or only ask of certain respondents.3 We then evaluated the correlation between each variable and attrition in the survey. Using pseudo R2 from a logistic regression model we selected the 30 variables that were most correlated with attrition. We then examined these variables to decide which ones had high R2 and would give us a theoretically robust model. We also consider which variable had the less amount of missing cases to minimize the need for imputing missing data. The final model included 14 variables.4 Using these variables we then predicted the likelihood that each person would complete wave 2.5 We then divided one by this attrition likelihood to get pre_retent the retention weight. The effect of dividing the probability of response by 1is Variables regarding household roasters, introduction consent variables, etc… were also dropped. The final list of variables tested for their correlation with attrition included—r_1, hc_3, sa_1, sa_4, sa_7, sa_10, sa_13, sa_19, sa_22, sa_25, sa_28, sa_31, ci_0, ne_1, hm_23_fe, hm_23_in, dm_1_yea, lv_1, lv_2, re_13, re_15, re_16, re_17, hc_1, hc_5, hc_6, hc_19, hc_21, se_1, se_2, se_3, se_4, se_5, se_6, se_7, ci_1, ca_1, rc_1, rc_2, rc_4, ic_1, ic_2, ic_3, ic_4, ic_6, ic_7, ic_8, ic_9, ic_10, ic_15, po_1, po_4, po_6, po_10, vo_1, ma_1, ma_2, ma_3, ma_4, ma_5, ma_6, ma_7, ma_8, ma_9, ma_10, ma_11, rm_1, rm_2, ra_1, ra_2, ra_3, ra_5, ra_5a, ra_6, ra_7, ra_8, ra_9, ra_11, ra_12, ra_15, ra_18, ra_19, ac_5, ac_6, adv_1, adv_2, adv_3, are_5, are_6, ama_1, ama_2, ama_3, ama_4, ama_5, ama_6, ama_7, ama_8, ama_10, ama_11, arbl_1, arbl_3, arbl_9, arbl_11, arbl_12, arbl_13, art_5, art_7, sp_1, sp_2, hm_1, hm_2, hm_3, hm_4, hm_5, hm_6, hm_7, hm_8, hm_9, hm_10, hm_11, hm_12, hm_15, hm_16, hm_17, hm_20, hm_22, hm_24, hm_25, hm_26, hm_27, hm_28, hm_29, hm_30, hm_32, hm_33, hm_35, hm_36, hm_37, hm_38, dm_1a, dm_1b, dm_1k, dm_1l, dm_2, dm_5, cl_2, cl_4, cl_9, cl_14, cl_20, iq_1, iq_4, iq_5, iq_6, iq_7, resp_rse, re_1a 4 The variables included in the model were as follows: age (dm_1_yea), gender (resp_rse), race (re_1a), education(dm_2), email frequency (sa_3), frequency of discussion with a person that has a professional degree (ic_3), voted in the last election (po_4), believes regarding abortion (ma_9), language the interview was conducted in (iq_4), frequency of light exercise (hm_15), if home was rent or owned (hc_5), interviewers rate of cooperativeness (iq_1), belief in God (ra_1), provided an additional phone number for future content (cl_9). Since there were only 6 individuals who highest degrees of completion was a religious masters (and interestingly all of these individuals did not complete wave 2) we combined this category with non-religious masters. All other variables were left as is for the model. 5 There were 95 cases that were missing at least one value for one of these 14 variables. To ensure these cases were not dropped (since we needed them to remain in the data to be assigned a weight) we had the computer randomly assigned values for the missing variables. Each of the possible values was equally likely to be chosen. We ran 30 different imputations and took the mean value. The seven cases that did not have age resulted in the largest variations because the possible age values was the largest range of any of the variables. 3 that people who had a high probability of participating in both waves receive a lower weight, while people with a low probability of participating in both waves receive a higher weight. Intuitively, wave 1 respondents with a low probability of participating in wave 2 are underrepresented in the wave 2 dataset, because fewer people with similar wave 1 responses participated in wave 2. Giving cases with a high probability of nonresponse a higher weight corrects for that unbalance. For example, if a respondent had a probability of responding in wave 2 of .5, that person would receive a weight of 2 (1/.5 = 2). Someone with an even lower probability of non-response, say .25 would receive a weight of 4. If the model for probability of response at wave 2 does a good job of predicting non-response, the weighted sample should more closely resemble the wave 2 sample that would have been collected if all respondents had been reinterviewed compared to the unweighted data. 𝑝𝑟𝑒_𝑟𝑒𝑡𝑒𝑛𝑡 = 1 𝑎𝑡𝑡𝑟𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 The retention weight (pre_retent) was then multiplied by the population weight from wave 1(pawt). This way the original geographic elements of the weights are maintained in the new longitudinal weight. loweight_w2rake—Post-stratified Longitude weight (sample scaled) Since wave 2 did not retain all respondents from wave 1, and we adjusted the weight to account for non-response at wave 2, the weights could not be plausibly assumed to produce estimates that are representative of the desired population. Hence, we once again post-stratified and adjusted the weights based on key demographic variables (age, race/ethnicity, and gender). Therefore, we took the original weight multiplied by the retention weight of just the people who continued from wave 1 to wave 2 and poststratified their weights to be representative of the race, gender, and age proportions of the 2006 United States’ population. To avoid having to collapse the smaller categories we elected to use popular technique called “raking.”6 Raking adjust the weights to match the population proportions through an iterative process (i.e. it performs the same set of steps multiple times). Unlike cell adjustments used for previous weights that divide the sample (and population) into all possible combinations of variables, raking uses the unconditional frequency of each variable. For example, raking might begin by adjusting the weights so that they match the race category total populations, then make the same adjustment for gender, then do the same Battaglia, Michael P., David Izrael, David C. Hoaglin, and Martin R. Frankel. “Tips and Tricks for Raking Survey Data (a.k.a. Sample Balancing).” American Association for Public Opinion Research http://www.amstat.org/sections/srms/Proceedings/y2004/files/Jsm2004-000074.pdf 6 for age, and then it returns to race readjusting the weights to match the race proportions. This process continues until the weights converge—meaning the weights no longer change from iteration to iteration. We elected to use the Stata program ipfweight created by Michael Bergmann to perform the raking on our weight variable. We did 1,000 iterations and checked to see if the estimated proportions of race, gender, and age converged the population proportions.7 The totals used for the ranking process only included the population 18 years old and older and were as follows: Race Totals White 155,927,239 Black 25,722,778 Hispanic 29,296,686 Asian 10,084,467 Other(Multiracial, 4,602,172 American Indian, Other etc…)8 Gender Totals Male 109,685,985 Female 115,947,357 Age Totals 18-29 50,033,230 30-49 86,303,561 50 + 89,296,551 ploweight_w2rake—Post-stratified Longitude weight (population scaled) To have a weight that total to the population we multiplied the loweight_w2rake weight by the quotient of the total population and the sample size. So for our case this means we multiplied loweight_w2rake weight by 225,633,342/1,316 or 171,323.72. 𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑙𝑜𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 = × 𝑙𝑜𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒 225,633,342 𝑝𝑙𝑜𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 = × 𝑙𝑜𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 1,316 7 We also used the Stata program ipfraking created by Stas Kolenikov. Yet the program would end after 6 iterations despite being considerable off from the control populations. This concerned us since we wanted to ensure our weights were as reliable as possible. Also the weights seemed to have several extremes which made us conscious. Thus, we also conducted a post-stratification cell adjustment were we used the same categories as in our raking. This produced very comparable and slightly more varied weights. To see full documentation on the cell adjustments or to recreate these weights for use see the do file “Raking Longitudinal and Cross Sectional Weights” section 5 “PostStratification Cell Adjustment.” After running several test with different configurations we concluded that raking with the ipfweight program was the most reliable. 8 This “other” category was calculated by subtracting the White, Black, Hispanic, and Asian totals from the total population over 18 years of age. Additional Documentation attrition.do This file was used to examine the variables for their relationship with a person’s likelihood to attrite. It also has notes on the bottom of the file on 30 variables most predictive of attrition and their corresponding pseudo R2s. Cleaning Wave 1 and 2 Respondents Based on Age This do file tracts the changes made to both wave 1 and wave 2 data regarding cases that do not actually seem to be the same person and changes in person’s birth year if provided in wave 2 when it was missing from wave 1. It has all the used code as well as several comments about the reasoning behind the decisions made. Creating Data Set for Attrition Model This do file creates the variable attrition_w2 to mark those in the sample that responded in wave 1 and wave 2. It also examines all questions asked in wave 1 and marks those ask of everyone as well as if these variables are continuous or categorical. This was done to set up a data set for the attrition model. imputing This do file imputes necessary missing data and calculates the agreed upon attrition model. Raking Longitudinal and Cross Sectional Weights This do file consist of all the program and notes surrounding the creation of the actual weights including the raking process. It also has the code for the post-stratification cell adjustment for the longitudinal weights for interested users. Where to Find the Weights Public Wave 1 pawt; PAWT2 Public Wave 2 psweight_w2rake; nweight_w2rake Public Merged Wave 1 and Wave 2 ploweight_w2rake; loweight_w2rake Restricted Wave 1 pre_retent Restricted Wave 2 persweight1; hhweight1; baseweight2; psweight2; nweight2