Murphy Supplement to Methodological Challenges in Constructing Effective Treatment Sequences for Chronic Psychiatric Disorders Susan A. Murphy1, University of Michigan; David W. Oslin, University of Pennsylvania; A. John Rush, University of Texas, Southwestern Medical Center; Ji Zhu, University of Michigan for MCATS2 Keywords: Clinical decision making, methodology, clinical trials, statistics, treatment, design 1 Corresponding author: SA Murphy, Institute for Social Research, 2068, Ann Arbor, MI 48106-1248 phone: 734-763-5046; fax: 734-763-4676; email: samurphy@umich.edu 2 Members of the MCATS network (alphabetical order) are Satinder Baveja (University of Michigan), Linda Collins (Pennsylvania State University), Marie Davidian (North Carolina State University), Kevin Lynch (University of Pennsylvania), James McKay (University of Pennsylvania), , Joelle Pineau (McGill University), Daniel Rivera (Arizona State University), Eric Rosenberg (Harvard Medical School), Thomas TenHave (University of Pennsylvania), and Anastasios Tsiatis (North Carolina State University). 1 Murphy Q-Learning: an analytic method for constructing decision rules To set the stage, consider the familiar situation of one treatment decision; in this case standard methods can be used to adapt the treatment to the individual characteristics. The case of multiple decisions is subsequently discussed. Suppose two treatments are available; the decision as to which treatment is best may be based on tailoring variables (here pretreatment observations). Suppose the variables ( X , T , Y ) for each subject are recorded; X denotes the subject's p pretreatment variables, T is an indicator variable that denotes the subject’s assigned treatment and Y denotes the subject's response (e.g., a score on a symptom rating scale or results of a biological assay). In order to use data to construct decision rules, a data analysis model that relates response to the pretreatment variables is employed. A particularly simple but useful model is, Y 0 1 X 1 .... p X p T ( 0 1Z1 ... q Z q ) (1) where the Z j ’s are selected X variables or summaries of the selected X variables (i.e., the Z j ’s are potential tailoring variables), and is the error term. The coefficients, that is the 's and ' s , might be estimated from data using regression analysis. Suppose a small value of Y corresponds to a good response, then the decision rule is determined by (1) as follows: the first step in constructing the decision rule is to minimize 2 Murphy T ( 0 1Z1 ... q Z q ) (2) in T. If this algebraic minimization is performed, the decision rule obtains: “Given a patient with tailoring variables ( Z 1,...., Zq ) , choose treatment 1 if the sum 0 1Z1 ... q Z q 0 and choose treatment 0 otherwise.” Now consider the above approach for sequential decisions (e.g., concerning either or both the timing of treatment alterations and the sequencing of these alterations). In this case, the treatment indicator T is time varying (since different treatments are considered over time) and most likely so is the response Y . A natural approach to informing the construction of the decision rules is to implement a series of models as in (1), with each model corresponding to a decision regarding when to alter treatment or which treatment should be next. That is conduct separate regressions, one per decision. In these cases, the X variables and associated Z summaries would include outcomes observed during the prior treatments (e.g., response level, adherence, side effects and so on) in addition to pretreatment variables. However, a series of models similar to (1) does not address the long-term benefit of each decision. This is because the response Y in each model represents only the short term effect to the present decision instead of a longer-term effect. Simply replacing Y by the response measured at the end of a longer duration is insufficient. Why? Because future 3 Murphy decisions influence the long-term impact of the present decision; thus the impacts of future decisions must also be incorporated. To illustrate Q-Learning suppose the goal is to minimize the average level of depression over a 4 month period, and suppose that data from the SMART design in Figure 1 is available. Note there are only two key decisions in this rather simple trial, the initial treatment decision and then the second treatment decision (for those not responding satisfactorily to the initial treatment). In Q-learning with SMART data the construction of the decision rules works backwards from the last decision to the first decision. Since there are two treatment decisions there are two regressions. Consider the last (here second) treatment decision and subjects whose depression did not remit. A simple model would be similar to (1); Y2 is the last depression score, Y2 20 21 X 21 .... 2 p X 2 p T2 ( 20 21Z 21 ... 2 qT2 q ) . (3) The subscript 2 indicates that the variables, X ' s and Z ' s , can be observed at any time up to the assignment of the second treatment. The variables, X ' s and Z ' s , might include observations of response during the initial treatment, presence of side effects, patient characteristics (such as number of past episodes) and would likely include the initial treatment to which the subject was assigned. The treatment T2 is coded as 1 if the switch in treatment is assigned and is coded as 0 otherwise. As before the 's and ' s , can be estimated from SMART data using regression analysis. The decision rule is constructed in a 4 Murphy similar fashion to the construction given at the beginning of this subsection. That is, given a patient with tailoring variables ( Z 21 ,...., Z 2 q ) , switch treatment if the sum 20 21Z 21 ... 2 q Z 2 q 0 . Now consider the initial decision. As discussed above it is insufficient to use the proximal response Y1 to the first decision in a model such as (1); Y1 only represents short term benefits instead of both short and long-term benefits. Instead a term is added to Y1 ; this term represents longer-term benefits of the initial decision. Denote this additional term by V ; model (3) provides V . In fact if a subject’s depression did not remit by 2 months then V 20 21 X 21 .... 2 p X 2 p ( 20 21Z 21 ... 2 q Z 2 q ) if 20 21Z 21 ... 2 q Z 2 q 0 (treatment t2 1 is best) and V 20 21 X 21 .... 2 p X 2 p otherwise (treatment T2 0 is best). Thus V represents the effect of the initial decision on both the depression score collected at the end of the 4 month period ( Y2 ) and on the best second treatment decision. For subjects whose depression remits by 2 months, set V to the predicted value of Y2 from a regression of Y2 on X 21 ,...., X 2 p for the remitting subjects. To construct the initial decision rule, use the data from all subjects and the model Y1 V 10 11 X 11 .... 1 p X 1 p T1 (10 11Z11 ...1q Z1q ) . The subscript 1 indicates that the variables, X ' s and Z ' s , can be observed at any time up to the assignment of the initial treatment. That is the potential tailoring variables (the Z ' s ) are pretreatment variables. The treatment T1 is 5 Murphy coded as 1 if the medication A is assigned and is coded as 0 otherwise. Note the addition of V to Y1 ( Y1 is the depression score collected at 2 months) accounts for the effect of the initial treatment T1 on both Y2 and on the choice of the best second treatment. As before the 's and ' s , can be estimated from SMART data using regression analysis. The constructed initial decision rule is, given a patient with tailoring variables ( Z11 ,...., Z1q ) , provide medication A if 10 11Z11 ...1q Z1q 0 . Needed Research and Collaboration: Most adaptive treatment strategies are multi-component treatments; the number of potential components may be large (e.g., medications, psychosocial therapies, adherence enhancement efforts, early versus late timing in changing treatment). Moreover, delivery mechanisms can vary (e.g., group therapy versus individual therapy, telephone vs. in-person counseling, etc.). A challenge is to generalize SMART designs so that scientists can sift through many potential components, eliminating those that are less effective. The generalized SMART design should minimize expense and logistical difficulties by implementing a minimal number of experimental conditions (groups of subjects each assigned a different adaptive treatment strategy). To generalize SMART designs, one might consider experimental designs from agriculture or engineering that test a plethora of components when the number of experimental conditions in the trial cannot be large (Box et al., 1978; Collins et al., 2005). These designs (called balanced fractional factorial designs) minimize the number of experimental conditions in a trial yet they preserve 6 Murphy power and permit testing of a variety of components. The number of experimental conditions is minimized by careful use of clinical experience, results from past studies and theory. To test the efficacy of any one component, one averages over responses from multiple experimental conditions. In these designs, the relevant sample size is not the number of subjects per experimental condition but, for example, in the case of main effects, is one half the total sample size. Balanced fractional factorial designs are already in use in the behavioral sciences (see http://healthmedia.umich.edu/projects/project.php?id=35 and Collins et al, 2005). These promising designs have been primarily developed to construct multi-component treatments that are not adaptive to patient outcomes. To generalize this approach for use in constructing adaptive treatment strategies, the development of a balanced fractional factorial design that will permit the evaluation of components that are assigned only when the subjects qualify on the basis of their outcomes (e.g., when the disorder remits or when the disorder relapses, etc.) is required. Tailoring Variables Because of the large number of potential tailoring variables, summaries are needed. Even though theory or clinical experience might suggest natural summaries of potential tailoring variables, data driven summaries might also be useful. Once formulated, both data derived, theoretical based and clinically based 7 Murphy summaries can be used to construct decision rules (e.g., using Q-Learning or a similar methodology). Consider the case of choosing one of two treatments for each patient based on pretreatment observations (potential tailoring variables). The construction of a number of summaries of the tailoring variables is called feature construction. Ideally the number of summaries is much smaller than the total number of variables (preferably one summary is sufficient). For the ease of interpretation, the summaries might be a weighted combination of the original variables. In forming the summaries, variables that are not helpful in decision making should be eliminated. Ideally, many of the weights will be zero, so the summary contains only the most important variables (variable selection). Variable selection is very important considering the expense and time needed to collect tailoring variables in both experimental and clinical settings. Principal component analysis (PCA) (Dunteman, 1989; Jolliffe, 2002), widely used to construct summaries, seeks linear combinations of the original tailoring variables such that the derived components account for maximal variance in the data. The first principal component is the summary capturing the most variance; the second principal component captures the most of the remaining variance and so on. Each principal component is a linear combination of all of the original (pretreatment) variables (non-zero weights), so there is no variable selection. 8 Murphy Jolliffe et al. (2003) introduced SCoTLASS (see Appendix B) to obtain modified principal components with zero weights. SCoTLASS works like PCA in that it finds the weights that maximize the variance, but it forces the weights for the less important tailoring variables to be set at zero (i.e., these variables are removed from summary). This method is useful when one wishes to retain those tailoring variables that are responsible for most of the variance, and to remove those variables that contribute little to the variance. A third method only produces summaries strongly related to the response (this is not necessarily the case with the PCA and SCoTLASS). This method is called Partial Least Squares (PLS) (see Haenlein and Kaplan, 2004 and Appendix B). PLS is a popular technique that uses the response variable to construct linear combinations of the tailoring variables. Specifically, PLS seeks weighted combinations of the tailoring variables that account for the variance and have high correlation with the response variable. Baer et al (2005) provide an example of its use in the context of longitudinal studies. Consider the case of choosing one of two treatments for each patient based on pretreatment observations. Suppose the data consist of ( x, t , r ) on each subject, where x is a collection of p potential tailoring (pretreatment) variables, t is an indicator variable denoting assigned treatment and r denotes the response. Consider the construction of a number of summaries of the tailoring variables, say ( z1 ,...., zq ) . For the ease of interpretation, each summary, z j , j=1,…,q might be a weighted combination of the original variables: 9 Murphy z j c j1 x1 ... c jp x p where (2) c ji , i=1,…,p, are weights. In forming the summaries, x ’s that are not helpful in decision making should be eliminated. SCoTLASS can be used to obtain modified summaries with zero weights. Here is how SCoTLASS works when only the first summary (the z1 ) is derived. SCoTLASS works like PCA in that it finds the weights (the c1 j ’s) that maximize the variance, p Var c1 j x j j 1 subject to the constraint, c p j 1 1j 2 1 , but it achieves sparsity in the weights by adding an additional constraint on the weights: p | c j 1 1j | u where the variable u is specified by the user. For sufficiently small value of u , some of the c1 j ’s will be set exactly equal to zero, hence the corresponding tailoring variables are removed from the summary. As discussed in the text, PLS is a popular technique that uses the response variable to construct summaries. Specifically, PLS seeks weighted combinations that account for the variance in the x ’s and have high correlation with the response variable. In particular, the first PLS component finds the 10 Murphy weights (the c1 j ’s) that maximize the product of the squared correlation and variance: p p Corr y, c1 j x j Var c1 j x j j 1 j 1 2 subject to the constraint that c p j 1 1j 2 1 . Each PLS summary is a linear combination of all of the original (pretreatment) variables (non-zero weights), so there is no variable selection. Needed Research and Collaboration: Much of the current work in feature construction and variable selection has focused on prediction rather than decision making. In the identification and formation of potential tailoring variables, ease and cost of collection (i.e., feasibility), reliability, and interpretability (by both clinicians and patients) and discriminatory power must be balanced. Those variables that are most time consuming and expensive to collect should be very highly informative to decision making. Thus, an important challenge for collaborative teams is to evaluate both the clinical and cost utility of the above methods (PCA, SCoTLASS and PLS) for decision making rather than simple prediction. These evaluations are particularly needed for sequential clinical decision making where there appears to have been little practical application of these approaches. A second challenge is to develop methods that will produce summaries targeted at decision making, rather than response prediction. Generalization of above mentioned techniques, e.g., PCA, SCoTLASS and PLS, may be fruitful. 11 Murphy These summaries should interact with treatment, rather than simply predict response. Variables that interact with treatment decisions are known as moderators (Kramer et al., 2001, 2002) or effect modifiers (Rothman & Greenland, 1998). Additionally the interaction should be sufficiently strong so that for patients with certain values of the pre- or during- treatment variable one of the treatment options or modifications is best while for patients with other values of the pretreatment variable another option or modification is best. These are called qualitative interactions (Peto, 1982; Byar, 1985). Another way to express this is that a tailoring variable might be of little use, even if it is a powerful predictor of response, if it does not predict differential response to treatment. A third challenge is to employ variable selection in creating summaries when the goal is decision making. Present variable selection methods commonly used in engineering and statistics such as subset selection, linear shrinkage methods (Breiman 1995, Tibshirani 1996), Bayesian model averaging (Hoeting, Madigan, Raftery & Volinsky 1999) and WINNOW (Littlestone 1988) have been developed only for the purpose of prediction. Lastly, these methods need to be generalized for use in sequential decision making so as to predict differential response to treatment. 12 Murphy Supplemental References Box GEP, Hunter WG, Hunter JS (1978): Statistics for experimenters: An introduction to design, data analysis, and model building. Wiley: New York. Breiman L (1995). Better subset regression using the nonnegative garrote. Technometrics, 37: 373-384. Byar DP (1985). Assessing apparent treatment-covariate interactions in randomized clinical trials. Stat Med 4: 255-263. Collins LM, Murphy SA Nair V Strecher V (2005). A strategy for optimizing and evaluating behavioral interventions. Ann of Behav Med 30:65-73. Dunteman G (1989): Principal Components Analysis (Quantitative Applications in the Social Sciences). SAGE:Newbury Park, CA. Haenlein M, Kaplan AM (2004). A beginner's guide to partial least squares analysis. Understanding Statistics 3(4):283-297. Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999), Bayesian model averaging: a tutorial (Pkg: p382-417). Statistical Science, 14 (4) : 382-401. Jolliffe IT (2002): Principal Component Analysis, Second Edition, Springer: New York. Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the LASSO Journal of Computational and Graphical Statistics, 12:531-547 Littlestone N (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning 2:285-318. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological 58:267-288. 13