PALS Weight Documentation

advertisement
PALS Weight Documentation
Junia Howell and Rose Medeiros
March 15, 2013
This document was created to give a detailed explanation of how the wave 2 longitude weights
were created. For breathe of explanation we explain all weight variables in wave 1 and wave 2
weights and when they are appropriate to use in research.
Weight Variables
Wave 1
pawt—Analysis Weight—Person Level (population scaled)1
PAWT2—Weight adjustment (sample scaled)
Wave 2
persweight1—weight adjustment
hhweight1—Approximate Wave 1 HH weight
baseweight2—Quasi-base weight for wave 2
psweight2—Post-stratified weight (population scaled)
nweight2—Post-stratified weight (sample scaled)
psweight_w2rake—Raked Post-stratified weight (population scaled)
nweight_w2rake—Raked Post-stratified weight (sample scaled)
Wave 2 Longitude Weights
pre_retent—Retention Weight
ploweight_w2rake—Post-stratified Longitude weight (population scaled)
loweight_w2rake—Post-stratified Longitude weight (sample scaled)
psu_id—Primary Sampling Unit ID
1
This is a sampling weight which in Stata is called a p-weight. The RTI International survey firm labeled it an
analysis weight. For clarity and consistency we kept this label yet it should be noted it is actually a sampling weight.
Suggested Use of Weights
Wave 1—When analyzing ONLY wave 1
pawt—For counts of the population. For example, calculating the number of people in
the US population that believe in the existence of God.
PAWT2—For regressions or proportions. Since this weight is scaled to the sample, the
weights are smaller and will introduce less variability into models.
Wave 2—When analyzing ONLY wave 2
psweight_w2rake— For counts of the population. For example, calculating the number
of people in the US population that believe in the existence of
God.
nweight_w2rake— For regressions or proportions. Since this weight is scaled to the
sample, the weights are smaller and will introduce less variability
into models.
Wave 2 Longitude Weights—When analyzing both wave 1 and wave 2
ploweight_w2rake— For counts of the population. For example, calculating the number
of people in the US population that changed from not believing in
the existence of God in 2006 to believing in the existence of God
in 2012.
loweight_w2rake— For regressions or proportions. Since this weight is scaled to the
sample, the weights are smaller and will introduce less variability
into models.
Sample Stata Code for Weights
To set the weights in Stata the suggested setting are:
svyset psu_id [pweight= pawt]
svyset psu_id [pweight= PAWT2]
svyset psu_id [pweight= psweight_w2rake]
svyset psu_id [pweight= nweight_w2rake]
svyset psu_id [pweight= ploweight_w2rake]
svyset psu_id [pweight= loweight_w2rake]
Commands can then be run with the svy prefix. If you are using subsets of the data use
the subpop() option instead of using the “if” qualifier.
svy : tabulate re_1a
svy , subpop(female) : regress re_17 re_1a
Explanation of Weight Variables
Wave 1
pawt—Analysis Weight—Person Level (population scaled)
This variable was created for the wave 1 data by RTI International survey firm. Given
that participates were selected out of specific zip codes the weight is first adjusted to
control of this geographic based selection. The weights were then post-stratified by
gender (2 categories), race/ethnicity (4 categories), and age (3 categories). Since the 2006
American Community Survey (ACS) was not yet released, 2005 ACS totals were used.
The total population 18 and over was 215,246,449—which is what the total of this weight
sums to. They used the cross tabs of all these categories (i.e. the weights were adjusted
within cell). In other words raking was not used to create this weight. For more
information regarding the creation of this weight see 3_PALS Sample Weighting-actual
RTIreport.pdf. For the population totals used see PALS Weight Summary 06FEB2007.pdf.
PAWT2—Weight adjustment (sample scaled)
This weight was also created by RTI International survey firm. It adjusts the pawt weight
so that it sums to the sample instead of the population total. So the summation of the
PAWT2 weights equals 2,610 whereas the summation of the pawt weight is 215,246,449.
Wave 2
persweight1—weight adjustment
Created by the Abt SRBI survey firm this is the PAWT2 weight from wave 1 with the
addition of the 101 new respondents (young adults who became eligible since 2006)
added in wave 2. The 101 new respondents were given the same person weight as the
person in wave 1 that answered the survey from their household.
hhweight1—Approximate Wave 1 HH weight
Created by the Abt SRBI survey firm this approximate household weight was created by
dividing the person weight (persweight1) by household size, given in wave 1 and
recorded in the hr_1 variable and rounding to the nearest tenth. In this way, the person’s
weight is assumed of the entire household but then divided up amongst the household
members.
baseweight2—Quasi-base weight for wave 2
Created by the Abt SRBI survey firm this weight is not mentioned in any of the
documentation. It is unclear what exactly the weight is or how it was used.
psweight2—Post-stratified weight (population scaled)
Created by the Abt SRBI survey firm this is a post-stratified weight that uses the
household weight from wave 1 as a base. Like wave 1, they stratified the sample by race,
gender and age using the ACS 2010. They state…
“The following post-stratification cells were constructed: race/ethnicity by gender for
minority groups (Black, Hispanic, Asian); gender by three age groups (18-29, 30-49,
50+) for non-Hispanic White and other/mixed races, for a total of 12 post-stratification
cells. The smallest groups were NHWO males under 30 (unweighted count of 41) and
Asian males (unweighted count of 52). The largest group was NHWO females over 50
(unweighted count of 229).”
nweight2—Post-stratified weight (sample scaled)
Created by the Abt SRBI survey firm this weight is scaled to the sample size. Thus,
psweight2 was multiplied by the quotient of the total population and the sample size.
psweight_w2rake—Raked Post-stratified weight (population scaled)
In order for the cross sectional weights to correspond with the longitude weights, we
decided to weight wave 2 using the same post-stratification method—raking—chosen for
the longitude weights. Following the lead of Abt SRBI we used the estimated household
weight, hhweight1, as the base for the new cross sectional weight. Nevertheless, with this
weight we raked to the 2011 (instead of the 2010) ACS totals. For more information on
raking see our explanation below in our description of the loweight_w2rake weight.
The totals used for the ranking process only included the population 18 years old and
older and were as follows:
Race Totals
White
157,901,961
Black
27,947,198
Hispanic
34,532,913
Asian
11,664,324
Other(Multiracial,
5,634,822
American Indian,
Other etc…)2
Gender Totals
Male
115,448,178
Female 122,233,040
Age Totals
18-29
52,298,454
30-49
83,421,502
50 +
101,961,262
There were three cases where the person who answered in wave 1 does not seem to be the
same person that answered in wave 2. While we exclude these three cases from the
longitudinal weight and analysis, we include 2 of them as new respondents in wave 2. We
This “other” category was calculated by subtracting the White, Black, Hispanic, and Asian totals from the total
population over 18 years of age.
2
gave them new IDs and flagged them with the variable flag_w2new so researchers could
decide if they wanted to include them or not in their analysis. In weighting process their
age and gender reported in wave 2 were used. Since they were not re-asked their race we
used the race of the wave 1 respondent who most likely being a family member had the
same race as the new respondent. For the third case, it seems as though a new respondent
answered for themselves and their father that had been the respondent in wave 1. Thus
this case was dropped from the raking process and not given a cross sectional weight.
This respondent if flagged via the wave2_use variable and we suggest is not used in any
analysis. For more information on how we made these decisions see the do-file “Cleaning
Wave 1 and 2 Respondents Based on Age.”
For respondents, that seem to be the same person, but gave a different birth year in wave
2 than they had in wave 1 or refused to answer in wave 2 we used their birth year in wave
1 to create the weight. Since all wave 1 interviewers were done in person this seems to be
the more reliable data point. There were seven respondents that did not give their age in
wave 1 but gave their age in wave 2. Again these were checked to see if they seemed to
be the same person by their responses on key variables. After concluding that they were
the same respondent, six of their ages from wave 2 were incorporated into wave 1 and
used for both the cross section and longitudinal weights. One of their birth years would
have meant the respondent was 12 in wave 1. Since no one under 18 was included and the
interview was in person we coded this as a typo and imputed a birth year using the age
from a randomly selected case in the dataset (i.e. unconditional hotdeck imputation). For
a more detailed documentation of this process see the do_file “Cleaning Wave 1 and 2
Respondants Based on Age.”
nweight_w2rake—Raked Post-stratified weight (sample scaled)
This weight is scales psweight_w2rake to the sample size. In other words,
psweight_w2rake was multiplied by a scaling factor, equal to sample size divided by the
total population.
𝑛𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 =
𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒
× 𝑝𝑠𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒
𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
Note that psweight_w2rake and nweight_w2rake have considerably less variability (less
than half the variance) than their un-raked counterparts psweight2 and nweight2 provided
by Abt SRBI survey firm. Therefore, it is our belief that the raked weights may be more
appropriate for analyses in regression models.
pre_retent—Retention Weight
Noting the importance of the geographic weight in the wave 1 survey for accounting for
the survey design, we wanted to start with the original person weight—pawt. At the same
time, we wanted to take into consideration that some individuals were more likely to
continue with the survey than others. While we could re-weight those who stayed in the
survey by their characteristics like race, gender and age by using the population data, we
wanted to also control for the fact the variables we had in wave 1 that are central to the
survey questions, like faith, might also be correlated with attrition. Thus, we created an
attrition model that predicted the likelihood of respondents remaining in the sample in
wave 2.
To create this attrition model we first flagged all the variables that were asked of all
respondents of wave one. In other words, we are excluding all the variables that were part
of skip patterns or only ask of certain respondents.3 We then evaluated the correlation
between each variable and attrition in the survey. Using pseudo R2 from a logistic
regression model we selected the 30 variables that were most correlated with attrition.
We then examined these variables to decide which ones had high R2 and would give us a
theoretically robust model. We also consider which variable had the less amount of
missing cases to minimize the need for imputing missing data. The final model included
14 variables.4 Using these variables we then predicted the likelihood that each person
would complete wave 2.5 We then divided one by this attrition likelihood to get
pre_retent the retention weight. The effect of dividing the probability of response by 1is
Variables regarding household roasters, introduction consent variables, etc… were also dropped. The final list of
variables tested for their correlation with attrition included—r_1, hc_3, sa_1, sa_4, sa_7, sa_10, sa_13, sa_19, sa_22,
sa_25, sa_28, sa_31, ci_0, ne_1, hm_23_fe, hm_23_in, dm_1_yea, lv_1, lv_2, re_13, re_15, re_16, re_17, hc_1,
hc_5, hc_6, hc_19, hc_21, se_1, se_2, se_3, se_4, se_5, se_6, se_7, ci_1, ca_1, rc_1, rc_2, rc_4, ic_1, ic_2, ic_3,
ic_4, ic_6, ic_7, ic_8, ic_9, ic_10, ic_15, po_1, po_4, po_6, po_10, vo_1, ma_1, ma_2, ma_3, ma_4, ma_5, ma_6,
ma_7, ma_8, ma_9, ma_10, ma_11, rm_1, rm_2, ra_1, ra_2, ra_3, ra_5, ra_5a, ra_6, ra_7, ra_8, ra_9, ra_11, ra_12,
ra_15, ra_18, ra_19, ac_5, ac_6, adv_1, adv_2, adv_3, are_5, are_6, ama_1, ama_2, ama_3, ama_4, ama_5, ama_6,
ama_7, ama_8, ama_10, ama_11, arbl_1, arbl_3, arbl_9, arbl_11, arbl_12, arbl_13, art_5, art_7, sp_1, sp_2, hm_1,
hm_2, hm_3, hm_4, hm_5, hm_6, hm_7, hm_8, hm_9, hm_10, hm_11, hm_12, hm_15, hm_16, hm_17, hm_20,
hm_22, hm_24, hm_25, hm_26, hm_27, hm_28, hm_29, hm_30, hm_32, hm_33, hm_35, hm_36, hm_37, hm_38,
dm_1a, dm_1b, dm_1k, dm_1l, dm_2, dm_5, cl_2, cl_4, cl_9, cl_14, cl_20, iq_1, iq_4, iq_5, iq_6, iq_7, resp_rse,
re_1a
4
The variables included in the model were as follows: age (dm_1_yea), gender (resp_rse), race (re_1a),
education(dm_2), email frequency (sa_3), frequency of discussion with a person that has a professional degree
(ic_3), voted in the last election (po_4), believes regarding abortion (ma_9), language the interview was conducted
in (iq_4), frequency of light exercise (hm_15), if home was rent or owned (hc_5), interviewers rate of
cooperativeness (iq_1), belief in God (ra_1), provided an additional phone number for future content (cl_9). Since
there were only 6 individuals who highest degrees of completion was a religious masters (and interestingly all of
these individuals did not complete wave 2) we combined this category with non-religious masters. All other
variables were left as is for the model.
5
There were 95 cases that were missing at least one value for one of these 14 variables. To ensure these cases were
not dropped (since we needed them to remain in the data to be assigned a weight) we had the computer randomly
assigned values for the missing variables. Each of the possible values was equally likely to be chosen. We ran 30
different imputations and took the mean value. The seven cases that did not have age resulted in the largest
variations because the possible age values was the largest range of any of the variables.
3
that people who had a high probability of participating in both waves receive a lower
weight, while people with a low probability of participating in both waves receive a
higher weight. Intuitively, wave 1 respondents with a low probability of participating in
wave 2 are underrepresented in the wave 2 dataset, because fewer people with similar
wave 1 responses participated in wave 2. Giving cases with a high probability of nonresponse a higher weight corrects for that unbalance. For example, if a respondent had a
probability of responding in wave 2 of .5, that person would receive a weight of 2 (1/.5 =
2). Someone with an even lower probability of non-response, say .25 would receive a
weight of 4. If the model for probability of response at wave 2 does a good job of
predicting non-response, the weighted sample should more closely resemble the wave 2
sample that would have been collected if all respondents had been reinterviewed
compared to the unweighted data.
𝑝𝑟𝑒_𝑟𝑒𝑡𝑒𝑛𝑡 =
1
𝑎𝑡𝑡𝑟𝑖𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
The retention weight (pre_retent) was then multiplied by the population weight from
wave 1(pawt). This way the original geographic elements of the weights are maintained
in the new longitudinal weight.
loweight_w2rake—Post-stratified Longitude weight (sample scaled)
Since wave 2 did not retain all respondents from wave 1, and we adjusted the weight to
account for non-response at wave 2, the weights could not be plausibly assumed to
produce estimates that are representative of the desired population. Hence, we once again
post-stratified and adjusted the weights based on key demographic variables (age,
race/ethnicity, and gender). Therefore, we took the original weight multiplied by the
retention weight of just the people who continued from wave 1 to wave 2 and poststratified their weights to be representative of the race, gender, and age proportions of the
2006 United States’ population. To avoid having to collapse the smaller categories we
elected to use popular technique called “raking.”6
Raking adjust the weights to match the population proportions through an iterative
process (i.e. it performs the same set of steps multiple times). Unlike cell adjustments
used for previous weights that divide the sample (and population) into all possible
combinations of variables, raking uses the unconditional frequency of each variable. For
example, raking might begin by adjusting the weights so that they match the race
category total populations, then make the same adjustment for gender, then do the same
Battaglia, Michael P., David Izrael, David C. Hoaglin, and Martin R. Frankel. “Tips and Tricks for Raking Survey
Data (a.k.a. Sample Balancing).” American Association for Public Opinion Research
http://www.amstat.org/sections/srms/Proceedings/y2004/files/Jsm2004-000074.pdf
6
for age, and then it returns to race readjusting the weights to match the race proportions.
This process continues until the weights converge—meaning the weights no longer
change from iteration to iteration.
We elected to use the Stata program ipfweight created by Michael Bergmann to perform
the raking on our weight variable. We did 1,000 iterations and checked to see if the
estimated proportions of race, gender, and age converged the population proportions.7
The totals used for the ranking process only included the population 18 years old and
older and were as follows:
Race Totals
White
155,927,239
Black
25,722,778
Hispanic
29,296,686
Asian
10,084,467
Other(Multiracial,
4,602,172
American Indian,
Other etc…)8
Gender Totals
Male
109,685,985
Female 115,947,357
Age Totals
18-29
50,033,230
30-49
86,303,561
50 +
89,296,551
ploweight_w2rake—Post-stratified Longitude weight (population scaled)
To have a weight that total to the population we multiplied the loweight_w2rake weight
by the quotient of the total population and the sample size. So for our case this means we
multiplied loweight_w2rake weight by 225,633,342/1,316 or 171,323.72.
𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑝𝑙𝑜𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 =
× 𝑙𝑜𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒
𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒
225,633,342
𝑝𝑙𝑜𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒 =
× 𝑙𝑜𝑤𝑒𝑖𝑔ℎ𝑡_𝑤2𝑟𝑎𝑘𝑒
1,316
7
We also used the Stata program ipfraking created by Stas Kolenikov. Yet the program would end after 6 iterations
despite being considerable off from the control populations. This concerned us since we wanted to ensure our
weights were as reliable as possible. Also the weights seemed to have several extremes which made us conscious.
Thus, we also conducted a post-stratification cell adjustment were we used the same categories as in our raking. This
produced very comparable and slightly more varied weights. To see full documentation on the cell adjustments or to
recreate these weights for use see the do file “Raking Longitudinal and Cross Sectional Weights” section 5 “PostStratification Cell Adjustment.” After running several test with different configurations we concluded that raking
with the ipfweight program was the most reliable.
8
This “other” category was calculated by subtracting the White, Black, Hispanic, and Asian totals from the total
population over 18 years of age.
Additional Documentation
attrition.do
This file was used to examine the variables for their relationship with a person’s
likelihood to attrite. It also has notes on the bottom of the file on 30 variables most
predictive of attrition and their corresponding pseudo R2s.
Cleaning Wave 1 and 2 Respondents Based on Age
This do file tracts the changes made to both wave 1 and wave 2 data regarding cases
that do not actually seem to be the same person and changes in person’s birth year if
provided in wave 2 when it was missing from wave 1. It has all the used code as well as
several comments about the reasoning behind the decisions made.
Creating Data Set for Attrition Model
This do file creates the variable attrition_w2 to mark those in the sample that responded
in wave 1 and wave 2. It also examines all questions asked in wave 1 and marks those
ask of everyone as well as if these variables are continuous or categorical. This was
done to set up a data set for the attrition model.
imputing
This do file imputes necessary missing data and calculates the agreed upon attrition
model.
Raking Longitudinal and Cross Sectional Weights
This do file consist of all the program and notes surrounding the creation of the actual
weights including the raking process. It also has the code for the post-stratification cell
adjustment for the longitudinal weights for interested users.
Where to Find the Weights
Public Wave 1
pawt; PAWT2
Public Wave 2
psweight_w2rake; nweight_w2rake
Public Merged Wave 1 and Wave 2
ploweight_w2rake; loweight_w2rake
Restricted Wave 1
pre_retent
Restricted Wave 2
persweight1; hhweight1; baseweight2; psweight2; nweight2
Download