Eliminating Aggregation Bias when Estimating Treatment Effects on Combined Outcomes with Applications to Quality of Life Assessment Ian M. McCarthy∗ Emory University Department of Economics† November 2014 Abstract Researchers are often interested in combined measures such as overall ratings, indices of physical or mental health, or health-related quality-of-life (HRQoL) outcomes. Such measures are typically composed of two or more underlying discrete variables. I show that estimating the effect of a treatment on the combined measure is biased with non-random treatment selection. I provide a solution to this problem by adopting an alternative estimator that first estimates treatment effects on the underlying variables and then combines these effects into an overall effect on the combined outcome of interest. JEL Classification: C18, C21, I10 Keywords: program evaluation, treatment effects, quality of life, cost effectiveness, comparative effectiveness Funding: This project was supported by grant number K99HS022431 from the Agency for Healthcare Research and Quality. The content is solely the responsibility of the author and does not necessarily represent the official views of the Agency for Healthcare Research and Quality. ∗ I would like to thank Dann Millimet, Rusty Tchernis, John Mullahy, and Jon Skinner for comments on early drafts, as well as the participants of the 2014 American Society of Health Economists Conference. † Emory University, Rich Memorial Building, Room 306, Atlanta, GA 30322, Email: ian.mccarthy@Emory.edu 1 1 Introduction In many areas of applied research, an average or some other weighting procedure is used to combine underlying outcomes into a single summary measure. For example, the County Health Rankings from the University of Wisconsin Population Health Institute are based on a single summary measure calculated as a weighted combination of five health-related outcome variables (Peppard et al., 2003, 2008; Courtemanche et al., 2013). The practice of combining several underlying outcome variables into a single summary measure is particularly prevalent in the analysis of individual health outcomes or health-related quality-of-life (HRQoL) data. For example, studies of physical limitations from the Health and Retirement Study (HRS) and similar datasets such as the Medical Expenditure Panel Survey often aggregate several individual discrete responses to generate some total number of limitations or some index of physical functioning, where the index or aggregated variable serves as the primary outcome of interest (Loprest et al., 1995; Dor et al., 2006; Dave et al., 2008; Haas, 2008). Researchers adopt a similar approach when analyzing common HRQoL assessments such as the EuroQol 5-dimension (EQ-5D) health outcome survey and the Short Form 6-dimension (SF-6D) health outcome survey, where the empirical analysis is often based solely on the summary score derived from aggregating the HRQoL profile into a single measure (Powell, 1984; Austin, 2002; Drummond et al., 2005; Manca et al., 2005; Brazier & Ratcliffe, 2007; Basu & Manca, 2012).1 These and other aggregate index scores play an increasingly important role in the evaluation of health care programs and policy, as demonstrated by numerous initiatives and legislative requirements to collect HRQoL data both internationally and within the U.S. (Department of Health, 2008; Devlin et al., 2010; Porter, 2010; Ahmed et al., 2012; PCORI, 2012; Selby et al., 2012; Landro, 2012). Examples of cost and comparative effectiveness studies based on these types of measures abound both in the health economics and health services research literature.2 And some widely-used datasets such as the RAND HRS data automatically provide researchers with aggregated indices including a mobility and activities of daily living index, formed based on the responses of several underlying binary outcomes.3 The implicit assumption is that estimating treatment effects on the aggregate outcome appropriately captures the combined treatment effects across the individual measures. However, the current study illustrates that this is not necessarily the case, and when confronted with non-random treatment assignment, an analysis based solely on index scores may yield biased treatment effect estimates. Importantly, this problem is separate from the common distributional issues encountered when analyzing 1 The EQ-5D is a five-dimensional HRQoL profile providing a score of 1 to 3 (or 1 to 5 in other versions of the questionnaire) in each of five health domains. The SF-6D is a similar metric composed of six health domains (Brazier et al., 2002; Brazier & Ratcliffe, 2007). 2 The development of cost and comparative effectiveness research in economics is reflected in Garber & Phelps (1997), Brauer et al. (2006), and Chandra et al. (2011), among many others. 3 RAND HRS data available for download at rand.org/labor/aging/dataprod/hrs-data.html. 2 HRQoL data (Austin, 2002; Manca et al., 2005; Basu & Manca, 2012; Hernández Alava et al., 2012) or from questions of which weights to adopt in the aggregation function. Rather, the problem arises more fundamentally from the aggregation process itself. I propose an alternative empirical approach that first estimates treatment effects on each underlying outcome and then re-interprets these effects in terms of the overall summary score (McCarthy, 2014). This two-stage estimator (2SE) therefore effectively reverses the order of the analysis, from aggregating to a summary score and then estimating treatment effects, to estimating treatment effects and then aggregating to the summary score. Through a series of Monte Carlo simulations, treatment effects estimates based solely on the summary score are shown to be biased under non-random treatment assignment, while the 2SE provides unbiased estimates of treatment effects. In the absence of treatment selection (e.g., randomly assigned treatment), the 2SE is also shown to provide equivalent treatment effects estimates to that of existing estimators used in the literature, with minor efficiency gains. I then provide two applications of the 2SE to the estimation of average treatment effects (ATE) with health outcomes data. The first application concerns the effect of retirement on physical health in the U.S., and the second application concerns the effect of complex spine surgery on patient HRQoL. The results reveal potentially large differences in the estimated ATE when relying solely on the index versus the 2SE, with ATE estimates differing by as much as 100% in some cases. The current paper contributes to a growing line of research on the differences in health outcomes evaluation based on aggregated scores versus individual or joint evaluation of each health outcome of interest (Mortimer & Segal, 2008; Devlin et al., 2010; Parkin et al., 2010; Gutacker et al., 2013). However, the main concern in many of these studies is the choice of aggregation algorithm. Instead, I focus on a more fundamental problem introduced by the aggregation process itself, regardless of the weights adopted. The results show that the distinction between analyses based on the underlying outcomes versus the combined outcome is not merely a normative issue surrounding the aggregation technique (e.g., how and for whom to measure preferences across health domains); but rather, a reliance solely on the combined outcome may yield biased treatment effects estimates when using observational data. I discuss the broad empirical framework and 2SE in Section 2. Details of the Monte Carlo exercise are presented in Section 3, with applications presented in Section 4. Section 5 concludes. 2 Methods The primary goal of the current analysis is to estimate the average treatment effect (ATE) on some combined outcome (or index score) when treatment assignment is non-random. Although several other treatment effects estimates may ultimately be of interest, I consider the average treatment effect in 3 order to focus the analysis on the specific impact of selection on observables in estimating treatment effects on combined outcomes. The empirical issues would naturally extend to other treatment effects such as the average treatment effect on the treated (ATT) or the average treatment effect on the untreated (ATU), with potentially larger bias for more recent estimators such as the person-specific treatment effects developed in Basu (2013). 2.1 Two-stage Estimator (2SE) Rather than rely on an aggregate outcome, I propose a two-stage estimator (2SE) which first estimates a model of each individual outcome and then re-interprets the coefficients in terms of the effects on the overall index. For example, a common outcome of interest in the HRS data is the mobility index ranging from 0 to 5 based on the individual’s self-reported difficultly of walking one block, walking several blocks, walking across a room, climbing one flight of stairs, and climbing several flights of stairs Dave et al. (2008). The mobility index is therefore constructed from five individual, binary outcomes. Denote a vector of D individual outcomes by y = (y1 , ..., yD ), and denote some aggregation function of these outcomes by f (y). In practice, f (y) is often some form of weighted average, or in the case of the mobility index discussed above, simply the sum of individual responses. The 2SE first estimates separate models for each individual dichotomous outcome. I then form predicted probabilities, P̂id , for each person i and each outcome d. The overall ATE can then be estimated using the method of recycled predictions, where I estimate predicted index scores under alternative covariate values in order to estimate effects of interest (Oaxaca, 1973; Graubard & Korn, 1999; Basu & Rathouz, 2005; Basu, 2005; Glick, 2007; Kleinman & Norton, 2009). Specifically, denoting treatment status by the indicator, Ti , I estimate the ATE by: AT E = N X {f (ŷi |Ti = 1) − f (ŷi |Ti = 0)} , (1) i=1 where f (ŷ|Ti = 1) and f (ŷ|Ti = 0) denote the aggregation function assigned to predicted values of y1 , ..., yD with and without treatment, respectively. Although equation 1 generally reflects the estimated ATE from the 2SE, the specific methods underlying the 2SE will vary by application. In the current paper, I consider two empirical applications: 1) the effect of retirement on physical health; and 2) the effect of surgical treatment on health for spine surgery patients. The second application is more in-line with a standard cost or comparative effectiveness study, while the first application revisits a common health economics question in light of the potential biases introduced through aggregation and highlighted in the Monte Carlo simulations in Section 3. In the remainder of this section, I discuss the specific form of the 2SE for each of these 4 applications as well as the traditional empirical methods adopted for comparison purposes. 2.2 Retirement and Health Measuring health based on the mobility index discussed previously, the aggregation function is simply P5 the sum of the individual binary outcomes, f (y) = d=1 yd . Denoting the index value by ỹ, and assuming panel data consistent with the HRS data, a linear fixed or random effects model is a common approach to estimating the effect of retirement on mobility (Mein et al., 2003; Van Solinge, 2007; Dave et al., 2008). In this case, the regression equation is as follows: ỹit = νi + xit β + εit , (2) where νi denote individual-level random (fixed) effects, and for simplicity, εit is an idiosyncratic error term with mean 0 and variance σ 2 . In this setting, the 2SE estimates D = 5 separate regressions - one for each individual binary outcome. Assuming the random effects, νi , follow a normal distribution with mean 0 and variance σν2 , the contribution to the likelihood from person i is: Z ∞ Li = −∞ 2 2 e−νi /2σν √ σν 2π (T i Y ) F (yidt , νi + xit β) dνi , t=1 where F (y, z) = 1/(1 + exp(−z)) if y = 1 and F (y, z) = 1/(1 + exp(z)) if y = 0. Maximum likelihood estimation proceeds with adaptive Gauss-Hermite quadrature. Although the HRS data are longitudinal by nature, the estimation of fixed effects models using conditional maximum likelihood introduces practical issues that may cloud the comparison between the 2SE and the standard linear estimators. My analysis using the panel structure of the data is therefore limited to random effects models. However, even in the case of random effects, the techniques required for maximum likelihood estimation of the random effects logistic model rely on approximations that may still call into question the comparability of the 2SE and linear models. I therefore consider an additional pre-post analysis where I analyze physical health in period t as a function of physical health in period t − 1, retirement status in period t, and other control variables. For this analysis, I estimate separate cross-sectional models for each HRS wave, where the first stage of the 2SE estimates a logistic regression model for each binary outcome, while the linear model is estimated using OLS. 5 2.3 Comparative Effectiveness The second application I consider is the effect of surgical treatment on HRQoL for patients undergoing spine surgery. This application relates directly to the growing cost and comparative effectiveness literature. In this application, HRQoL is measured by the SF-6D, which is described in more detail in Appendix A. Details of the 2SE applied to the SF-6D are presented in McCarthy (2014). Generally, I estimate separate ordered probit models for each HRQoL domain, and use the results of each model to form predicted probabilities of responses. I denote the predicted probabilities by P̂ijd , for person i, response j, and domain d. For example, in the physical functioning domain of the SF-6D, the first-stage results PF PF PF provide six predicted probabilities for each person, P̂i1 , P̂i2 , ..., P̂i6 . Continuing this process for all domains and adopting the scoring algorithm in Table 1, the probability estimates from the ordered probit estimation can then be converted to an SF-6D summary score as follows:4 PF PF PF PF PF Ŝi = 1 − 0.035 × P̂i2 + P̂i3 − 0.044 × P̂i4 − 0.056 × P̂i5 − 0.117 × P̂i6 RL RL RL + P̂i3 + P̂i4 − 0.053 × P̂i2 (3) SF SF SF SF − 0.057 × P̂i2 − 0.059 × P̂i3 − 0.072 × P̂i4 − 0.087 × P̂i5 P ain P ain P ain P ain P ain − 0.042 × P̂i2 + P̂i3 − 0.065 × P̂i4 − 0.102 × P̂i5 − 0.171 × P̂i6 MH MH MH MH + P̂i3 − 0.042 × P̂i2 − 0.100 × P̂i4 − 0.118 × P̂i5 V V V V − 0.071 × P̂i2 + P̂i3 + P̂i4 − 0.092 × P̂i5 − 0.061 × P̂ (Most Severe) . Equation 3 allows the researcher to use all of the available HRQoL information but still offers the familiar interpretation of effects in terms of the composite score. Applied to HRQoL data, the 2SE also avoids some of the statistical difficulties in the analysis of HRQoL data that are introduced by the underlying scoring process (e.g., censoring at the boundaries of the summary score). Existing methods for the analysis of HRQoL data include standard OLS, variations of the classic Tobit model, censored least-absolute deviations, Beta MLE, and Beta QMLE (Powell, 1984; Austin, 2002; Basu & Manca, 2012). For comparison with the 2SE, I therefore consider OLS as well as the Beta MLE and QMLE models proposed in Basu & Manca (2012), where I again estimate the ATE using the method of recycled predictions. For the Beta MLE and QMLE models, note that the conditional 4 See McCarthy (2014) for details regarding the estimated probability of a “most severe” outcome in the SF-6D. 6 mean function for estimating the ATE is (Basu & Manca, 2012): exp xi β̂ . µ̂i (yi |xi ) = 1 + exp xi β̂ 3 (4) Monte Carlo Simulations 3.1 General Case of Binary Individual Outcomes I first highlight the problem with a simplified data generating process (DGP) in which a single index is derived from a sum of five individual binary outcomes. Each individual outcome is generated from an underlying latent continuous variable, ∗ yid,t=0 = αd + xi βd + εid,t=0 , (5) where d denotes the domain or the individual outcome measure, i denotes a person, t = 0 denotes the baseline or pre-treatment period, and εid is assumed to follow a normal distribution with µ = 0 and ∗ σ = 1, independent across d.5 The median of the empirical distribution functions of yd,t=0 , denoted ȳd∗ , is taken as the threshold value for the observed binary outcome. For simplicity, x consists only of a single, normally distributed covariate with µ = 0 and σ = 1. Similarly, post-treatment (t = 1) latent outcomes are generated as follows: ∗ yid,t=1 = αd + xi βd + δd Ti + εid,t=1 , where δd denotes the treatment effect in domain d, Ti denotes treatment status, and the remaining variables are similarly defined from equation (5). The observed outcome, yid,t=1 , is then calculated based on the baseline threshold values, ȳd∗ , and the index score for baseline and post-treatment is calculated as the sum of all 5 binary outcomes. Within this structure, I simulate 50 different datasets with alternative degrees of selection (on observables). First, I generate a random variable for each person from a uniform distribution with support from 0 to 1, ri ∼ U [0, 1] for all i. Treatment status is then determined by ρ Ti = 1 ri > Φ xi ∗ 50 5 The estimation could be extended to allow for nonzero correlation across domains; however, such an approach would only impact the efficiency of the estimated coefficients and would not impact the point estimates. Since the current focus is on the bias introduced through the aggregation process itself, I simplify the DGP and assume the error terms of the individual measures are independent. 7 for ρ = 0, ..., 50. Each value of ρ therefore represents a different DGP, with ρ = 0 representing the case of no selection and ρ = 50 representing the highest extent of selection considered. For each value of ρ, I simulate 100 datasets with 5,000 observations in each simulation. Following common approaches in the applied literature, I estimate the ATE on the index scores using standard OLS and compare this to the effect from the 2SE. Results from the simulations are summarized in Figure 1. The top row of Figure 1 is based on a DGP with identical functional forms for each latent outcome: αd = 0.5 and βd = 0.5 for all d, and δd = 1. The bottom row considers heterogeneous treatment effects across domains, with δ = [1, 1.5, 2, .5, 0]0 and all other coefficients unchanged. Figure 1 clearly illustrates the increasing bias introduced through the selection process. In both DGPs, even though the common assumption of unconfoundedness holds, the influence of selection on the individual outcomes is not appropriately accounted for when focusing solely on the index measure. The 2SE, meanwhile, restores the unbiased estimation of the average treatment effect by estimating the effects on each individual outcome and then converting into an effect on the index. Intuitively, the bias when relying on the combined outcome derives from an inherent nonlinearity that is not accounted for when relying solely on the combined outcome. An OLS model that is linear in x is therefore misspecified. As a robustness check, I re-estimated the combined outcome models with additional nonlinear terms in x, including x2 , x3 , and indicator variables for x based on quartile. The results were largely unchanged from Figure 1. Therefore, although nonlinearities in x may be the source of the bias, it is unclear how to sufficiently approximate this nonlinearity in a single regression specification. The 2SE, meanwhile, avoids this problem altogether and explicitly estimates the probabilities of the every individual outcome, each of which is nonlinear in x. 3.2 Specific Case of the SF-6D I simulate SF-6D data beginning with a latent continuous variable for each HRQoL domain (d = ∗ 1, ..., 6), denoted yid , specified as a function of a 1 × 2 vector of covariates, xi , a constant, a treatment indicator Ti , and a normally distributed error term, εid . I further denote by α the 6 × 1 vector of intercept coefficients, by β the 6 × 2 vector of slope coefficients on x, and by δ the 6 × 1 vector of treatment effect coefficients. 8 As a baseline DGP, I generate the 6 × 1 vector of latent HRQoL values, yi∗ , as follows: yi∗ = α + βx0i + δTi + εi , where ε ∼ N (06×1 , I6×6 ) , x1 , x2 ∼ N(0, 1), α = 0.5 × I6×1 , β = I6×1 × [1.5, 1], and δ = 1.5 × I6×1 . I extend this baseline DGP by considering larger treatment effects, δ = 3 × I6×1 , and variable treatment effects across domains, δ = [2, 1, 0.5, 2.5, 0, 1]0 . I also consider the role of interaction yi∗ = α + βx0i + δTi + γx0i × Ti + εi , where terms in the estimated treatment effects by specifying γ = [1.5, 1, 2.5, 0.5, 2, 0.5]0 - again for three different parameterizations of δ, δ = 1.5×I6×1 , δ = 3×I6×1 , and δ = [2, 1, 0.5, 2.5, 0, 1]0 . In all cases, I adopt two alternative possibilities for treatment status: 1) random treatment assignment, Ti = 1(ri < 0.5) with ri ∼ U[0, 1] ∀i; and 2) selection on observed variables, Ti = 1 (ri < Φ (1.5 − 3x1i )). In total, this yields 12 simulated DGPs, 6 with random treatment assignment and 6 with selection on observed variables. Observed HRQoL values, yid for d ∈ (1, 2, 3, 4, 5, 6), are then generated based on the value of the ∗ latent value, yid , relative to the Jd × 1 vector of threshold values in each domain: γP F = [−2, −0.6, 0.5, 1.8, 3.2]0 γRL = [−1, 0.6, 2.2]0 γSF = [−1.6, 0, 1.2, 2.6]0 γP = [−1.8, −0.5, 0.5, 1.7, 3.1]0 γM H = [−1.6, −0.1, 1.3, 2.7]0 γV = [−1.7, −0.2, 1.2, 2.8]0 . Threshold values were selected based on the respective quantile for each domain. For example, the threshold values in the physical functioning domain are such that approximately 1/6th of the observations fall below -2 in that domain, 1/6th fall between -2 and -0.6, etc. The resulting distributions of the summary scores are therefore well-behaved and relatively normally distributed between 0.3 and 1 (the minimum and maximum values based on the SF-6D scoring algorithm in Brazier & Ratcliffe (2007), respectively). 9 For each DGP, I simulate 1,000 datasets consisting of N = 500 observations. As discussed in Section 2, I estimate the ATE on the summary score with four alternative estimators: 1) 2SE; 2) standard OLS; 3) the Beta MLE model proposed in Basu & Manca (2012); and 4) the Beta QMLE also proposed in Basu & Manca (2012). The results are summarized in Table 2. Table 2 The 2SE consistently provides unbiased estimates of the true ATE across a range of alternative DGPs. By comparison, ATEs estimated with all other estimators are downward biased. The 2SE also provides the lowest RMSE in all cases, although the differences in RMSE across estimators are minimal and statistically insignificant. 4 Empirical Applications 4.1 Retirement and Physical Health In my first application, I examine the effect of retirement on physical health using data from seven longitudinal waves of the HRS. The HRS is a biannual survey conducted by the University of Michigan beginning in 1992. I include all four HRS cohorts in my analysis; the original cohort consisting of individuals born between 1931 and 1941, as well as additional cohorts consisting of individuals born before 1924, between 1924 and 1930, and between 1942 to 1947. In order to focus the analysis specifically on the problem of aggregation, I consider as the measure of physical health the mobility index as described in Section 2. This index is constructed from the individual’s self-reported difficultly of walking one block, walking several blocks, walking across a room, climbing one flight of stairs, and climbing several flights of stairs. Since the underlying mobility measures are only available starting in wave 2 (1994), my analysis is limited to 1994 through 2012. In both the random effects and cross-sectional analysis, I consider mobility as a function of age, race, gender, education, total household income, mother’s age and education, father’s age and education, whether the individual has any form of health insurance, and the individual’s retirement status. Summary statistics are provided in Table 3, and I discuss the details of these variables and the dataset construction in Appendix B. Table 3 10 Results from the random effects model are summarized in Table 4. I focus on the bottom row of Table 4 which presents the overall ATE for the 2SE and the linear random effects model. Note that a positive effect implies a worsening of physical health (i.e., increase in mobility difficulties). The results reveal large relative differences in the estimated ATE with the 2SE versus the linear model, with the linear model estimating a 20% larger ATE relative to the 2SE. Table 4 Results from the cross-sectional analysis by HRS wave are illustrated in Figure 2.6 The figure presents the 90% confidence interval for the estimated ATE of retirement on the mobility index for each wave, with results for the 2SE and OLS in the left and right panels, respectively. The magnitude of the effects at each wave are less than those from the longitudinal analysis in Table 4 and the absolute differences in ATEs between the 2SE and OLS are subsequently smaller; however, the relative differences between the 2SE and OLS persist, with the estimated ATE from OLS exceeding that of the 2SE by as much as 37% in 2006. Figure 2 As has been noted in the literature, the direction of the relationship between retirement and health is not clear ex ante. In particular, it may be that individuals of lesser health are more likely to retire, as documented by Dwyer & Mitchell (1999), McGarry (2004), Jones et al. (2010), and others. To address this source of endogeneity, I follow Dave et al. (2008) and limit the retired sample only to those who had no reported health problems prior to retirement. The results from the random effects model are summarized in Table 5, while the results from the cross-sectional analysis by HRS wave are illustrated in Figure 3. Table 5 and Figure 3 The results from the longitudinal analysis are of smaller magnitude than the full-sample results, while larger relative differences between OLS and 2SE. The cross-sectional analysis, however, reveals 6 The pre-post analysis considers each HRS wave separately, and for each HRS wave, I estimate separate logit regression models for each mobility measure as well as an overall linear model for the mobility index. The full results of these regressions are omitted for brevity, and I focus instead on the estimated effect of full retirement on mobility. At each HRS wave, the sample is limited to those ages 50 to 75 who are newly retired or otherwise not retired and in the work force. 11 particularly large differences between the 2SE and OLS. Specifically, 2SE finds a relatively large improvement in mobility following retirement among individuals who were healthy prior to retirement (up to a 26% improvement in mobility relative to the overall sample average of 0.75), while estimated ATEs using OLS are consistently lower in magnitude, of opposite signs in 1996 and 2010, and even fall outside of the 2SE confidence bands in 2002 and 2004. 4.2 HRQoL Data following Spine Surgery As an additional application, I examine the effect of surgical intervention on HRQoL for adult spinal deformity (ASD) patients. The data for this application were collected from a multi-center, prospective database maintained by the International Spine Study Group (ISSG). One of the largest datasets of its kind, the data consist of 362 consecutively enrolled adult scoliosis and spinal deformity patients at a participating ISSG member (i.e., hospital or surgery center), with institutional review board (IRB) approval obtained at all centers. For purposes of this application, I include as covariates the patient’s age, gender, body mass index (BMI), SF-6D scores at the time of the first physician visit, and a dummy variable for whether the patient underwent surgical treatment. The outcome of interest is the patient’s HRQoL after one year. Summary statistics are provided in Table 6. Table 6 Coefficient estimates and ATEs are provided in Table 7. As with the prior analysis of retirement and health, the treatment effect for each individual outcome measure is translated to an ATE on the index score using the method of recycled predictions, and these ATE estimates are presented in the bottom of Table 7. I also compared the model fit across all estimators as measured by the RMSE. Table 7 The estimated effects in this application are relatively small, with an estimated ATE of 0.033 based on OLS compared to the effect of 0.029 using the 2SE. Although these differences are not statistically significant, the relative difference is potentially meaningful, with the estimated ATEs from OLS and Beta QMLE nearly 14% larger than the effects estimated from the 2SE. The magnitude of these differences is perhaps more meaningful when put into context of a cost effectiveness study. For example, if the incremental cost of surgery averages $100,000, a difference of 0.004 in the estimated 12 effect of surgery on QALYs equates to a difference of nearly $30,000 per QALY in terms of cost effectiveness over a 20 year period (with QALYs discounted at 3% per year and all costs incurred at period 0). 5 Discussion This study considered the potential bias introduced when estimating treatment effects using traditional regression methods based solely on combined outcomes, and illustrated this bias through a series of Monte Carlo simulations with a variety of alternative DGPs and treatment assignment mechanisms. In the presence of selection on observed variables, the source of the bias is not simply a matter of replacing linear methods some other nonlinear model. Instead, the bias derives from a fundamental difference between the impact of selection on the underlying domains versus the impact of selection on the overall aggregated score. The results therefore indicate that an analysis based solely on summary scores cannot appropriately control for the impact of selection, even under the standard unconfoundedness assumptions. The bias demonstrated in this paper would be compounded when studying health outcomes over an extended period. For example, in the cost effectiveness literature, researchers typically calculate the HRQoL index score and add up the individual scores over time (Drummond et al., 2005; Brazier & Ratcliffe, 2007; Gray et al., 2011). To the extent treatment effects persist over time, the magnitude of the bias under these conventional methods would only increase. The Monte Carlo results illustrated the improved performance available via an alternative two-stage estimator that first estimates the treatment effect on each underlying outcome and then re-interprets these effects in terms of the index score based on predicted values from the first-stage regressions. The 2SE was shown to restore the unbiased estimation of treatment effects while maintaining the parsimony of the summary score interpretation. In doing so, the proposed methodology can improve the economic evaluation of health care programs when randomized controlled trials are not available. As funding for clinical and health economics studies becomes more competitive, researchers (particularly in health economics) are increasingly dependent on secondary data analysis. The proposed methodology is therefore an important step in ensuring that economic analysis and inference performed outside of randomized trials remain accurate and informative, while still allowing for the familiar interpretation of results in terms of the composite score. 13 References Ahmed, Sara, Berzon, Richard A, Revicki, Dennis A, Lenderking, William R, Moinpour, Carol M, Basch, Ethan, Reeve, Bryce B, Wu, Albert W, et al. 2012. The use of patient-reported outcomes (PRO) within comparative effectiveness research: implications for clinical practice and health care policy. Medical Care, 50(12), 1060–1070. Austin, P.C. 2002. A comparison of methods for analyzing health-related quality-of-life measures. Value in Health, 5(4), 329–337. Basu, A., & Manca, A. 2012. Regression Estimators for Generic Health-Related Quality of Life and Quality-Adjusted Life Years. Medical Decision Making, 32(1), 56–69. Basu, Anirban. 2005. Extended generalized linear models: simultaneous estimation of flexible link and variance functions. Stata Journal, 5(4), 501–516. Basu, Anirban. 2013. Estimating Person-centered Treatment (PeT) Effects using Instrumental Variables: An Application to Evaluating Prostate Cancer Treatments. Journal of Applied Econometrics. Basu, Anirban, & Rathouz, Paul J. 2005. Estimating marginal and incremental effects on health outcomes using flexible link and variance function models. Biostatistics, 6(1), 93–109. Brauer, Carmen A., Rosen, Allison B., Greenberg, Dan, & Neumann, Peter J. 2006. Trends in the Measurement of Health Utilities in Published Cost-Utility Analyses. Value in Health, 9(4), 213 – 218. Brazier, J., & Ratcliffe, J. 2007. Measuring and valuing health benefits for economic evaluation. Oxford University Press, USA. Brazier, J., Roberts, J., & Deverill, M. 2002. The estimation of a preference-based measure of health from the SF-36. Journal of health economics, 21(2), 271–292. Chandra, A., Jena, A.B., & Skinner, J.S. 2011. The Pragmatist’s Guide to Comparative Effectiveness Research. The Journal of Economic Perspectives, 25(2), 27–46. Courtemanche, Charles, Soneji, Samir, & Tchernis, Rusty. 2013. Modeling Area-Level Health Rankings. Tech. rept. National Bureau of Economic Research. Dave, Dhaval, Rashad, Inas, & Spasojevic, Jasmina. 2008. The Effects of Retirement on Physical and Mental Health Outcomes. Southern Economic Journal, 497–523. 14 Department of Health. 2008. Guidance on the Routine Collection of Patient Reported Outcome Measures (PROMs). Devlin, N.J., Parkin, D., & Browne, J. 2010. Patient-reported outcome measures in the NHS: new methods for analysing and reporting EQ-5D data. Health economics, 19(8), 886–905. Dor, Avi, Sudano, Joseph, & Baker, David W. 2006. The effect of private insurance on the health of older, working age adults: evidence from the Health and Retirement Study. Health services research, 41(3p1), 759–787. Drummond, M.F., Sculpher, M.J., & Torrance, G.W. 2005. Methods for the economic evaluation of health care programmes. Oxford University Press, USA. Dwyer, Debra Sabatini, & Mitchell, Olivia S. 1999. Health problems as determinants of retirement: Are self-rated measures endogenous? Journal of health economics, 18(2), 173–193. Garber, A.M., & Phelps, C.E. 1997. Economic foundations of cost-effectiveness analysis. Journal of Health Economics, 16(1), 1–31. Glick, H. 2007. Economic evaluation in clinical trials. Oxford University Press, USA. Graubard, Barry I, & Korn, Edward L. 1999. Predictive margins with survey data. Biometrics, 55(2), 652–659. Gray, A.M., Clarke, P.M., Wolstenholme, J., & Wordsworth, S. 2011. Applied Methods of Costeffectiveness Analysis in Healthcare. Oxford Univ Pr. Gutacker, Nils, Bojke, Chris, Daidone, Silvio, Devlin, Nancy, & Street, Andrew. 2013. Hospital Variation in Patient-Reported Outcomes at the Level of EQ-5D Dimensions: Evidence from England. Medical Decision Making. Haas, Steven. 2008. Trajectories of functional health: the long armof childhood health and socioeconomic factors. Social Science & Medicine, 66(4), 849–861. Hernández Alava, Mónica, Wailoo, Allan J, & Ara, Roberta. 2012. Tails from the peak district: adjusted limited dependent variable mixture models of EQ-5D questionnaire health state utility values. Value in Health, 15(3), 550–561. Jones, Andrew M, Rice, Nigel, & Roberts, Jennifer. 2010. Sick of work or too sick to work? Evidence on self-reported health shocks and early retirement from the BHPS. Economic Modelling, 27(4), 866–880. 15 Kleinman, Lawrence C, & Norton, Edward C. 2009. What’s the risk? A simple approach for estimating adjusted risk measures from nonlinear models including logistic regression. Health services research, 44(1), 288–302. Landro, L. 2012. The Simple Idea That Is Transforming Health Care. The Wall Street Journal. Loprest, Pamela, Rupp, Kalman, & Sandell, Steven H. 1995. Gender, disabilities, and employment in the health and retirement study. Journal of Human Resources, S293–S318. Manca, A., Hawkins, N., & Sculpher, M.J. 2005. Estimating mean QALYs in trial-based cost- effectiveness analysis: the importance of controlling for baseline utility. Health economics, 14(5), 487–496. McCarthy, I. 2014. Putting the Patient in Patient Reported Outcomes: A Robust Methodology for Health Outcomes Assessment. Health Economics, 10.1002/hec.3113. McGarry, Kathleen. 2004. Health and Retirement Do Changes in Health Affect Retirement Expectations? Journal of Human Resources, 39(3), 624–648. Mein, Gill, Martikainen, Pekka, Hemingway, Harry, Stansfeld, Stephen, & Marmot, Michael. 2003. Is retirement good or bad for mental and physical health functioning? Whitehall II longitudinal study of civil servants. Journal of Epidemiology and Community Health, 57(1), 46–49. Mortimer, D., & Segal, L. 2008. Comparing the incomparable? A systematic review of competing techniques for converting descriptive measures of health status into QALY-weights. Medical decision making, 28(1), 66. Oaxaca, Ronald. 1973. Male-female wage differentials in urban labor markets. International economic review, 14(3), 693–709. Parkin, D., Rice, N., & Devlin, N. 2010. Statistical analysis of EQ-5D profiles: does the use of value sets bias inference? Medical Decision Making, 30(5), 556–565. PCORI. 2012. Draft National Priorities for Research and Research Agenda: version 1. Peppard, P., Kindig, D., Riemer, A., Dranger, E., & Remington, P. 2003. Wisconsin County Health Rankings. Tech. rept. Wisconsin Public Health and Health Policy Institute. Peppard, Paul E, Kindig, David A, Dranger, Elizabeth, Jovaag, Amanda, & Remington, Patrick L. 2008. Ranking community health status to stimulate discussion of local public health issues: the Wisconsin County Health Rankings. American Journal of Public Health, 98(2), 209–212. 16 Porter, Michael E. 2010. What Is Value in Health Care? New England Journal of Medicine, 363(26), 2477–2481. PMID: 21142528. Powell, J.L. 1984. Least absolute deviations estimation for the censored regression model. Journal of Econometrics, 25(3), 303–325. Selby, J.V., Beal, A.C., & Frank, L. 2012. The Patient-Centered Outcomes Research Institute (PCORI) national priorities for research and initial research agenda. JAMA: The Journal of the American Medical Association, 307(15), 1583–1584. Van Solinge, Hanna. 2007. Health Change in Retirement A Longitudinal Study among Older Workers in the Netherlands. Research on Aging, 29(3), 225–256. 17 A Appendix A: Description of the SF-6D The SF-6D is a six-dimensional health profile derived from a subset of responses from the SF-36 or SF-12 (Brazier et al., 2002; Brazier & Ratcliffe, 2007). The six dimensions of health classified by the SF-6D are: 1) physical functioning; 2) role limitations; 3) social functioning; 4) pain; 5) mental health; and 6) vitality. Each domain is characterized numerically with a range of integers, where a 1 indicates the best value in each domain. The worst value in each domain varies, with values up to 6 in the physical functioning and pain domains, values up to 5 in the social functioning, mental health, and vitality domains, and values up to 4 in the role limitations domain. The patient’s full SF-6D profile is therefore characterized by a series of six integers, with the best health state represented by {1, 1, 1, 1, 1, 1} and the worst health state represented by {6, 4, 5, 6, 5, 5}. Taking all possible combinations of responses, the SF-6D defines 18,000 unique health states. Each health state can then be converted into a single index score using available scoring algorithms that essentially assign weights to each domain and interactions between domains. Following the algorithm in Brazier & Ratcliffe (2007), the resulting SF-6D index score ranges from 0.30 to 1.0, with 0.30 representing the poorest health state, {6, 4, 5, 6, 5, 5}, and 1 representing the best health state, {1, 1, 1, 1, 1, 1}.7 The scoring algorithm from Brazier & Ratcliffe (2007) is reproduced in Table 1. Table 1 7 With one HRQoL assessment at one-year follow-up and no discounting, the patient’s SF-6D index score is equivalent to the patient’s QALY over the follow-up period. QALYs and summary scores are therefore sometimes treated synonymously in the empirical literature (e.g., Basu & Manca (2012), Gutacker et al. (2013), and others). 18 B Appendix B: Construction of HRS Dataset The RAND HRS data from years 1994 through 2012 include 207,816 total observations and 37,319 individual respondents. Following Dave et al. (2008), I exclude any observations in which the respondent is below 50 or older than 75 years of age, reducing the total dataset to 151,856 observations and 31,809 respondents. I also exclude observations in which the mobility index or any underlying mobility measure is missing. This leaves 126,620 total observations and 29,317 respondents. Finally, I exclude all observations for the respondent if (at any time in the data) the individual is not in the labor force other than being fully retired, resulting in the final dataset of 55,105 observations and 14,999 individuals. In my analysis of the subset of healthy individuals prior to retirement, I further exclude individuals with “poor” self-reported health status prior to retirement or with any documented health problem (i.e., diabetes, heart disease, stroke, high blood pressure, arthritis, cancer, lung disease, or mental health problems). For the cross-sectional analysis, I include observations in which individuals are partly retired, and the regressions include a dummy variable capturing whether this person is partly retired during the relevant HRS wave. These observations were excluded in the longitudinal analysis for two reasons: 1) more direct comparison with the existing literature; and 2) inherent difficulty interpreting the effect of full retirement when the individual reported being partly retired in the prior interview. 19 C Tables and Figures Table 1: Scoring Algorithm for SF-6Da Starting value = 1.0 (perfect health) Physical Functioning (PF) PF=2 or PF=3 -0.035 PF=4 -0.044 PF=5 -0.056 PF=6 -0.117 Role Limitations (RL) RL=2 or RL=3 or RL=4 -0.053 Social Functioning (SF) SF=2 -0.057 SF=3 -0.059 SF=4 -0.072 SF=5 -0.087 Pain (P) P=2 or P=3 -0.042 P=4 -0.065 P=5 -0.102 P=6 -0.171 Mental Health (MH) MH=2 or MH=3 -0.042 MH=4 -0.100 MH=5 -0.118 Vitality (V) V=2 or V=3 or V=4 -0.071 V=5 -0.092 Combination of Domains “Most Severe” -0.061 a Algorithm based on Brazier & Ratcliffe (2007). “Most Severe” denotes any one of the following responses: a level of 4 or more in the physical functioning, social functioning, mental health, or vitality domains; a level of 3 or more in the role limitation domain; or a level of 5 or more in the pain domain. 20 -.1 -.05 Deviation from True Effect -.05 0 .05 Deviation from True Effect 0 .05 .1 .1 .15 Figure 1: ATE Estimates with Binary Outcomesa 0 10 20 30 Degree of Selection 40 50 0 10 OLS 20 30 Degree of Selection 40 50 40 50 2SE Deviation from True Effect 0 -.05 -.05 Deviation from True Effect 0 .05 .05 .1 αd = 0.5, βd = 0.5, δd = 1 ∀d 0 10 20 30 Degree of Selection 40 50 0 10 20 30 Degree of Selection OLS 2SE αd = 0.5, βd = .5, ∀d, and δ = [1, 1.5, 2, .5, 0] a Solid lines denote the average deviation from the true value, with dotted lines reflecting the 95% confidence bands. 21 Table 2: ATE Estimates with HRQoL Dataa Model Random Treatment Assignment Treatment Effect St. Dev. RMSEb DGP 1, δ = 1.5 × I6×1 True Effect 0.142 0.005 2SE 0.143 0.006 0.054 OLS 0.143 0.007 0.066 Beta MLE 0.169 0.012 0.082 Beta QMLE 0.143 0.007 0.067 DGP 2, δ = 3 × I6×1 True Effect 0.264 0.007 2SE 0.264 0.007 0.046 OLS 0.265 0.008 0.077 Beta MLE 0.296 0.010 0.075 Beta QMLE 0.264 0.008 0.061 DGP 3, δ = [2, 1, 0.5, 2.5, 0, 1]0 True Effect 0.104 0.004 2SE 0.104 0.005 0.055 OLS 0.104 0.006 0.063 Beta MLE 0.117 0.012 0.087 Beta QMLE 0.104 0.006 0.070 DGP 4, interaction terms with δ = 1.5 × I6×1 True Effect 0.122 0.006 2SE 0.122 0.007 0.048 OLS 0.122 0.008 0.084 Beta MLE 0.133 0.011 0.096 Beta QMLE 0.122 0.008 0.074 DGP 5, interaction terms with δ = 3 × I6×1 True Effect 0.220 0.007 2SE 0.220 0.007 0.043 OLS 0.220 0.008 0.096 Beta MLE 0.231 0.011 0.081 Beta QMLE 0.220 0.008 0.068 DGP 6, interaction terms with δ = [2, 1, 0.5, 2.5, 0, 1]0 True Effect 0.102 0.005 2SE 0.102 0.006 0.047 OLS 0.102 0.007 0.078 Beta MLE 0.114 0.012 0.109 Beta QMLE 0.102 0.008 0.081 a Results b RMSE: Selection on Observed Variables Treatment Effect St. Dev. RMSE 0.142 0.143 0.151 0.174 0.146 0.005 0.007 0.010 0.021 0.011 0.054 0.068 0.080 0.066 0.264 0.263 0.284 0.378 0.320 0.007 0.009 0.010 0.018 0.013 0.046 0.091 0.067 0.056 0.104 0.104 0.088 0.083 0.079 0.004 0.007 0.009 0.023 0.011 0.055 0.064 0.083 0.070 0.122 0.122 0.137 0.234 0.165 0.006 0.010 0.014 0.023 0.015 0.048 0.094 0.085 0.073 0.220 0.220 0.266 0.332 0.272 0.007 0.010 0.014 0.022 0.015 0.043 0.132 0.080 0.065 0.102 0.102 0.098 0.210 0.137 0.005 0.009 0.013 0.024 0.015 0.047 0.079 0.090 0.081 based on 1,000 bootstrap iterations for N = 500 observations in each DGP. Root mean squared error 22 Table 3: Summary Statistics for HRS Dataa Standard Deviation Variable Mean Individual-level Variables Female Hispanic White Native Born Educationb Mother’s Education Father’s Education Fully Retiredc People 0.476 0.106 0.736 0.880 12.780 9.862 9.615 0.538 14,999 0.499 0.308 0.441 0.325 3.062 3.687 4.012 0.499 Panel-level Variables Age Mother’s Age Father’s Age Married Health Insurance Household Income ($1,000s) Mobility Index Difficulty Walking one block Difficulty Walking several blocks Difficulty Walking across room Difficulty Climbing one flight of stairs Difficulty Climbing several flights of stairs Fully Retired Observations 61.961 75.614 71.715 0.699 0.926 74.238 0.752 0.079 0.188 0.031 0.106 0.348 0.430 55,105 6.977 13.570 13.949 0.459 0.262 302.210 1.238 0.270 0.391 0.173 0.308 0.476 0.495 a To avoid differences in weights across panels, the statistics presented are the unweighted sample means and standard deviations. Sample is limited to individuals ages 50 to 75 who are either actively working the labor force or fully retired. Individuals who are otherwise not in the labor force (e.g., partial retirement, disability, or unemployed) at any HRS wave are excluded. b All education variables are measured in years. c Reflects whether the individual was fully retired at any HRS wave 23 Table 4: Retirement and Health, Random Effects Modelsa Dependent Variable Retired Female Hispanic White Education Native Born Mother’s Education Father’s Education Married Health Insurance Household Income (log) Age Mother’s Age Father’s Age Constant ATE on Mobility Indexb 1.92*** (0.10) 0.25*** (0.10) -0.36* (0.21) -0.45*** (0.12) -0.16*** (0.02) 0.89*** (0.20) -0.00 (0.02) -0.03* (0.02) -0.22** (0.09) 0.27* (0.15) -0.28*** (0.04) 0.03*** (0.01) -0.01*** (0.00) -0.01** (0.00) -2.79*** (0.69) Individual Mobility Measures 1.51*** 2.23*** 1.52*** (0.07) (0.14) (0.08) 0.48*** 0.18 0.66*** (0.08) (0.12) (0.08) -0.20 -0.27 0.13 (0.17) (0.26) (0.17) -0.31*** -0.38*** -0.25** (0.10) (0.15) (0.10) -0.19*** -0.14*** -0.16*** (0.02) (0.02) (0.02) 1.06*** 0.57** 0.71*** (0.16) (0.25) (0.16) 0.00 -0.04 0.00 (0.02) (0.02) (0.02) -0.03** 0.01 -0.04*** (0.01) (0.02) (0.01) -0.10 -0.50*** -0.28*** (0.07) (0.12) (0.08) 0.33*** 0.11 0.01 (0.11) (0.19) (0.11) -0.32*** -0.34*** -0.30*** (0.03) (0.05) (0.03) 0.05*** -0.02* 0.02*** (0.01) (0.01) (0.01) -0.01*** -0.01** -0.01*** (0.00) (0.00) (0.00) -0.01** -0.01 -0.01** (0.00) (0.00) (0.00) -1.52*** -0.64 -0.67 (0.55) (0.85) (0.56) 0.267*** (0.009) a Each 0.82*** (0.06) 1.24*** (0.07) 0.00 (0.14) -0.24*** (0.09) -0.16*** (0.01) 0.93*** (0.13) -0.01 (0.01) -0.06*** (0.01) -0.04 (0.06) 0.18** (0.08) -0.19*** (0.03) 0.05*** (0.00) -0.01*** (0.00) -0.01*** (0.00) -0.66 (0.46) Mobility Index 0.32*** (0.02) 0.22*** (0.02) -0.04 (0.04) -0.12*** (0.03) -0.05*** (0.00) 0.25*** (0.03) -0.00 (0.00) -0.01*** (0.00) -0.06*** (0.02) 0.07*** (0.02) -0.06*** (0.01) 0.01*** (0.00) -0.00*** (0.00) -0.00*** (0.00) 1.20*** (0.14) 0.321*** (0.018) individual mobility measure regression equation is estimated using a random effects logit model. The overall mobility index is estimated using a linear random effects model. The sample is limited to individuals between 50 and 75 years of age who are either in the labor force or fully retired. Robust standard errors are reported in parentheses. All models include fixed effects for census region and HRS wave. * p<0.1. ** p<0.05. *** p<0.01. b ATE: Average Treatment Effect. Bootstrap standard errors calculated for the 2SE based on 500 replications. 24 .2 0.148 .15 .15 .2 Figure 2: Cross-sectional Analysis of Retirement and Health 0.137 0.126 0.122 0.118 .1 .1 0.116 0.085 0.070 0.065 0.043 0.059 0.034 0 0.030 -.05 -.05 0 0.062 0.051 .05 .05 0.050 1996 1998 2000 2002 2004 2006 2008 2010 Two-stage Estimator 1996 1998 2000 2002 2004 2006 2008 2010 OLS 25 Table 5: Retirement and Health among Healthy Retirees, Random Effects Modelsa Dependent Variable Retired Female Hispanic White Education Native Born Mother’s Education Father’s Education Married Health Insurance Household Income (log) Age Mother’s Age Father’s Age Constant ATE on Mobility Indexb 0.06 (0.36) 0.74*** (0.21) -0.01 (0.38) -0.05 (0.24) -0.09** (0.04) 0.64* (0.37) -0.00 (0.04) -0.02 (0.03) 0.13 (0.21) 0.07 (0.25) -0.27*** (0.09) 0.09*** (0.02) -0.02** (0.01) -0.01 (0.01) -9.07*** (1.67) Individual Mobility Measures 0.45* 0.99** 0.26 (0.23) (0.50) (0.29) 0.73*** 0.57** 0.93*** (0.14) (0.29) (0.17) -0.00 -0.52 -0.02 (0.26) (0.58) (0.30) 0.03 0.08 -0.14 (0.17) (0.35) (0.19) -0.13*** -0.17*** -0.10*** (0.03) (0.06) (0.03) 0.66*** 0.14 0.48* (0.25) (0.51) (0.28) 0.02 -0.03 -0.02 (0.03) (0.06) (0.03) -0.05** 0.01 -0.07*** (0.02) (0.05) (0.03) 0.25* 0.01 0.06 (0.14) (0.31) (0.16) 0.18 0.18 -0.23 (0.16) (0.40) (0.18) -0.28*** -0.34** -0.27*** (0.06) (0.14) (0.07) 0.08*** 0.03 0.07*** (0.01) (0.03) (0.01) -0.02*** 0.01 -0.02*** (0.01) (0.01) (0.01) -0.00 -0.00 -0.01 (0.01) (0.01) (0.01) -5.29*** -6.97*** -4.10*** (1.02) (2.42) (1.25) 0.023* (0.012) 0.16 (0.18) 1.52*** (0.11) 0.02 (0.20) -0.10 (0.13) -0.11*** (0.02) 0.94*** (0.19) -0.02 (0.02) -0.09*** (0.02) 0.11 (0.10) 0.13 (0.12) -0.22*** (0.05) 0.07*** (0.01) -0.01*** (0.00) -0.01** (0.00) -2.68*** (0.78) Mobility Index 0.06 (0.04) 0.21*** (0.02) -0.01 (0.03) -0.02 (0.02) -0.02*** (0.00) 0.12*** (0.03) -0.00 (0.00) -0.01*** (0.00) 0.02 (0.02) 0.02 (0.02) -0.04*** (0.01) 0.01*** (0.00) -0.00*** (0.00) -0.00* (0.00) 0.33** (0.15) 0.059 (0.184) a Each individual mobility measure regression equation is estimated using a random effects logit model. The overall mobility index is estimated using a linear random effects model. The sample is limited to individuals between 50 and 75 years of age who are either in the labor force or fully retired and who had no health problems in the waves prior to retirement. Robust standard errors are reported in parentheses. All models include fixed effects for census region and HRS wave. * p<0.1. ** p<0.05. *** p<0.01. b ATE: Average Treatment Effect. Bootstrap standard errors calculated for the 2SE based on 500 replications. 26 .2 .2 .1 .1 Figure 3: Cross-sectional Analysis of Retirement and Health among Healthy Retirees 0.086 -0.016 -0.016 0 0 0.042 -0.016 -0.008 -0.048 -0.199 -0.193 -.1 -0.088 -0.120 -.3 -0.204 -0.099 -.2 -0.124 -.3 -.2 -.1 -0.101 -0.115 1996 1998 2000 2002 2004 2006 2008 2010 Two-stage Estimator 1996 1998 2000 2002 2004 2006 2008 2010 OLS 27 Table 6: Summary Statistics for ISSG Data (N=362) Variable Mean Standard Deviation Age BMI Baseline SF-6D Follow-up SF-6D 56.76 26.59 0.61 0.66 Count 14.51 5.84 0.12 0.12 Percent Operative Female 193 53% 309 85% Baseline Count Percent Physical Functioning Domain PF=1 0 PF=2 35 PF=3 117 PF=4 96 PF=5 100 PF=6 14 Role Limitations Domain RL=1 41 RL=2 115 RL=3 10 RL=4 196 Social Functioning Domain SF=1 110 SF=2 72 SF=3 99 SF=4 56 SF=5 25 Pain Domain P=1 5 P=2 34 P=3 79 P=4 85 P=5 109 P=6 50 Mental Health Domain MH=1 76 MH=2 127 MH=3 89 MH=4 53 MH=5 17 Vitality Domain V=1 13 V=2 73 V=3 107 V=4 94 V=5 75 28 Follow-up Count Percent 0% 10% 32% 27% 28% 4% 0 54 121 83 95 9 0% 15% 33% 23% 26% 2% 11% 32% 3% 54% 53 144 11 154 15% 40% 3% 42% 30% 20% 27% 15% 7% 156 77 86 30 13 43% 21% 24% 8% 4% 1% 9% 22% 23% 30% 14% 19 47 123 88 66 19 5% 13% 34% 24% 18% 5% 21% 35% 25% 15% 5% 130 132 61 32 7 36% 36% 17% 9% 2% 4% 20% 30% 26% 21% 15 123 108 74 42 4% 34% 30% 20% 12% Table 7: Regression Resultsa OLS Outcome: Surgery Age Female BMI Baseline HRQoL SF-6D Index QALY Beta MLE QALY Beta QMLE QALY PF RL SF P MH V 0.03*** (0.01) 0.00* (0.00) -0.02 (0.01) -0.00 (0.00) 0.17*** (0.05) 0.00 (0.00) -0.10 (0.07) -0.00 (0.00) 0.15*** (0.05) 0.00* (0.00) -0.09 (0.07) -0.00 (0.00) -0.06 (0.12) -0.00 (0.00) -0.12 (0.16) -0.00 (0.01) -0.06 (0.12) -0.01 (0.00) -0.27 (0.17) -0.02** (0.01) 0.14 (0.12) 0.00 (0.00) 0.07 (0.17) 0.01 (0.01) 0.54*** (0.12) 0.01** (0.00) -0.09 (0.16) -0.02 (0.01) 0.28** (0.12) 0.00 (0.00) -0.47*** (0.18) -0.00 (0.01) 0.26** (0.12) 0.00 (0.00) -0.31* (0.17) 0.00 (0.01) 0.58*** (0.05) 2.61*** (0.25) 2.63*** (0.24) PF Ordered Probit 0.56*** (0.07) RL 0.41*** (0.06) SF 0.51*** (0.05) P 0.44*** (0.05) MH 0.62*** (0.06) V ATEb RMSEc 0.59*** (0.06) 0.033*** (0.011) 0.098 0.038*** (0.011) 0.111 0.032*** (0.011) 0.098 0.029*** (0.010) 0.097 a Results based on OLS, Beta MLE, Beta QMLE, and Ordered Probit regressions. Beta MLE and QMLE estimation follows the procedure and code available from Basu & Manca (2012). Standard errors in parenthesis, * p<0.1. ** p<0.05. *** p<0.01. b ATE: Average Treatment Effect. c RMSE: root mean squared error. 29