Supplement to “Methodological Challenges in Constructing

advertisement
Murphy
Supplement to
Methodological Challenges in Constructing Effective Treatment
Sequences for Chronic Psychiatric Disorders
Susan A. Murphy1, University of Michigan; David W. Oslin, University of
Pennsylvania; A. John Rush, University of Texas, Southwestern Medical Center; Ji
Zhu, University of Michigan
for
MCATS2
Keywords: Clinical decision making, methodology, clinical trials, statistics,
treatment, design
1
Corresponding author: SA Murphy, Institute for Social Research, 2068, Ann Arbor, MI 48106-1248
phone: 734-763-5046; fax: 734-763-4676; email: samurphy@umich.edu
2
Members of the MCATS network (alphabetical order) are Satinder Baveja (University of Michigan),
Linda Collins (Pennsylvania State University), Marie Davidian (North Carolina State University), Kevin
Lynch (University of Pennsylvania), James McKay (University of Pennsylvania), , Joelle Pineau (McGill
University), Daniel Rivera (Arizona State University), Eric Rosenberg (Harvard Medical School), Thomas
TenHave (University of Pennsylvania), and Anastasios Tsiatis (North Carolina State University).
1
Murphy
Q-Learning: an analytic method for constructing decision rules
To set the stage, consider the familiar situation of one treatment decision;
in this case standard methods can be used to adapt the treatment to the
individual characteristics. The case of multiple decisions is subsequently
discussed. Suppose two treatments are available; the decision as to which
treatment is best may be based on tailoring variables (here pretreatment
observations). Suppose the variables ( X , T , Y ) for each subject are recorded; X
denotes the subject's p pretreatment variables, T is an indicator variable that
denotes the subject’s assigned treatment and Y denotes the subject's response
(e.g., a score on a symptom rating scale or results of a biological assay). In
order to use data to construct decision rules, a data analysis model that relates
response to the pretreatment variables is employed. A particularly simple but
useful model is,
Y   0  1 X 1  ....   p X p  T ( 0  1Z1  ... q Z q )  
(1)
where the Z j ’s are selected X variables or summaries of the selected X
variables (i.e., the Z j ’s are potential tailoring variables), and  is the error term.
The coefficients, that is the  's and  ' s , might be estimated from data using
regression analysis. Suppose a small value of Y corresponds to a good response,
then the decision rule is determined by (1) as follows: the first step in
constructing the decision rule is to minimize
2
Murphy
T ( 0  1Z1  ... q Z q )
(2)
in T. If this algebraic minimization is performed, the decision rule obtains:
“Given a patient with tailoring variables ( Z 1,...., Zq ) , choose
treatment 1 if the sum  0  1Z1  ... q Z q  0 and choose treatment 0
otherwise.”
Now consider the above approach for sequential decisions (e.g.,
concerning either or both the timing of treatment alterations and the sequencing
of these alterations). In this case, the treatment indicator T is time varying
(since different treatments are considered over time) and most likely so is the
response Y . A natural approach to informing the construction of the decision
rules is to implement a series of models as in (1), with each model corresponding
to a decision regarding when to alter treatment or which treatment should be
next. That is conduct separate regressions, one per decision. In these cases, the
X variables and associated Z summaries would include outcomes observed
during the prior treatments (e.g., response level, adherence, side effects and so
on) in addition to pretreatment variables. However, a series of models similar to
(1) does not address the long-term benefit of each decision. This is because the
response Y in each model represents only the short term effect to the present
decision instead of a longer-term effect. Simply replacing Y by the response
measured at the end of a longer duration is insufficient. Why? Because future
3
Murphy
decisions influence the long-term impact of the present decision; thus the
impacts of future decisions must also be incorporated.
To illustrate Q-Learning suppose the goal is to minimize the
average level of depression over a 4 month period, and suppose that data from
the SMART design in Figure 1 is available. Note there are only two key decisions
in this rather simple trial, the initial treatment decision and then the second
treatment decision (for those not responding satisfactorily to the initial
treatment).
In Q-learning with SMART data the construction of the decision rules
works backwards from the last decision to the first decision. Since there are two
treatment decisions there are two regressions. Consider the last (here second)
treatment decision and subjects whose depression did not remit. A simple model
would be similar to (1); Y2 is the last depression score,
Y2   20   21 X 21  ....   2 p X 2 p  T2 ( 20   21Z 21  ... 2 qT2 q )   .
(3)
The subscript 2 indicates that the variables, X ' s and Z ' s , can be observed at
any time up to the assignment of the second treatment. The variables, X ' s and
Z ' s , might include observations of response during the initial treatment,
presence of side effects, patient characteristics (such as number of past
episodes) and would likely include the initial treatment to which the subject was
assigned. The treatment T2 is coded as 1 if the switch in treatment is assigned
and is coded as 0 otherwise. As before the  's and  ' s , can be estimated from
SMART data using regression analysis. The decision rule is constructed in a
4
Murphy
similar fashion to the construction given at the beginning of this subsection. That
is, given a patient with tailoring variables ( Z 21 ,...., Z 2 q ) , switch treatment if the
sum  20   21Z 21  ... 2 q Z 2 q  0 .
Now consider the initial decision. As discussed above it is insufficient to
use the proximal response Y1 to the first decision in a model such as (1); Y1 only
represents short term benefits instead of both short and long-term benefits.
Instead a term is added to Y1 ; this term represents longer-term benefits of the
initial decision. Denote this additional term by V ; model (3) provides V . In fact
if a subject’s depression did not remit by 2 months then
V   20   21 X 21  ....   2 p X 2 p  ( 20   21Z 21  ... 2 q Z 2 q ) if  20   21Z 21  ... 2 q Z 2 q  0
(treatment t2  1 is best) and V   20   21 X 21  ....   2 p X 2 p otherwise (treatment
T2  0 is best). Thus V represents the effect of the initial decision on both the
depression score collected at the end of the 4 month period ( Y2 ) and on the best
second treatment decision. For subjects whose depression remits by 2 months,
set V to the predicted value of Y2 from a regression of Y2 on X 21 ,...., X 2 p for the
remitting subjects. To construct the initial decision rule, use the data from all
subjects and the model
Y1  V  10  11 X 11  ....  1 p X 1 p  T1 (10  11Z11  ...1q Z1q )   .
The subscript 1 indicates that the variables, X ' s and Z ' s , can be observed at
any time up to the assignment of the initial treatment. That is the potential
tailoring variables (the Z ' s ) are pretreatment variables. The treatment T1 is
5
Murphy
coded as 1 if the medication A is assigned and is coded as 0 otherwise. Note the
addition of V to Y1 ( Y1 is the depression score collected at 2 months) accounts
for the effect of the initial treatment T1 on both Y2 and on the choice of the best
second treatment. As before the  's and  ' s , can be estimated from SMART
data using regression analysis. The constructed initial decision rule is, given a
patient with tailoring variables ( Z11 ,...., Z1q ) , provide medication A if
10  11Z11  ...1q Z1q  0 .
Needed Research and Collaboration: Most adaptive treatment strategies
are multi-component treatments; the number of potential components may be
large (e.g., medications, psychosocial therapies, adherence enhancement efforts,
early versus late timing in changing treatment). Moreover, delivery mechanisms
can vary (e.g., group therapy versus individual therapy, telephone vs. in-person
counseling, etc.). A challenge is to generalize SMART designs so that scientists
can sift through many potential components, eliminating those that are less
effective. The generalized SMART design should minimize expense and logistical
difficulties by implementing a minimal number of experimental conditions
(groups of subjects each assigned a different adaptive treatment strategy).
To generalize SMART designs, one might consider experimental designs
from agriculture or engineering that test a plethora of components when the
number of experimental conditions in the trial cannot be large (Box et al., 1978;
Collins et al., 2005). These designs (called balanced fractional factorial designs)
minimize the number of experimental conditions in a trial yet they preserve
6
Murphy
power and permit testing of a variety of components. The number of
experimental conditions is minimized by careful use of clinical experience, results
from past studies and theory. To test the efficacy of any one component, one
averages over responses from multiple experimental conditions. In these
designs, the relevant sample size is not the number of subjects per experimental
condition but, for example, in the case of main effects, is one half the total
sample size. Balanced fractional factorial designs are already in use in the
behavioral sciences (see
http://healthmedia.umich.edu/projects/project.php?id=35 and Collins et al,
2005). These promising designs have been primarily developed to construct
multi-component treatments that are not adaptive to patient outcomes. To
generalize this approach for use in constructing adaptive treatment strategies,
the development of a balanced fractional factorial design that will permit the
evaluation of components that are assigned only when the subjects qualify on
the basis of their outcomes (e.g., when the disorder remits or when the disorder
relapses, etc.) is required.
Tailoring Variables
Because of the large number of potential tailoring variables, summaries
are needed. Even though theory or clinical experience might suggest natural
summaries of potential tailoring variables, data driven summaries might also be
useful. Once formulated, both data derived, theoretical based and clinically based
7
Murphy
summaries can be used to construct decision rules (e.g., using Q-Learning or a
similar methodology).
Consider the case of choosing one of two treatments for each patient
based on pretreatment observations (potential tailoring variables). The
construction of a number of summaries of the tailoring variables is called feature
construction. Ideally the number of summaries is much smaller than the total
number of variables (preferably one summary is sufficient). For the ease of
interpretation, the summaries might be a weighted combination of the original
variables. In forming the summaries, variables that are not helpful in decision
making should be eliminated. Ideally, many of the weights will be zero, so the
summary contains only the most important variables (variable selection).
Variable selection is very important considering the expense and time needed to
collect tailoring variables in both experimental and clinical settings.
Principal component analysis (PCA) (Dunteman, 1989; Jolliffe, 2002),
widely used to construct summaries, seeks linear combinations of the original
tailoring variables such that the derived components account for maximal
variance in the data. The first principal component is the summary capturing the
most variance; the second principal component captures the most of the
remaining variance and so on. Each principal component is a linear combination
of all of the original (pretreatment) variables (non-zero weights), so there is no
variable selection.
8
Murphy
Jolliffe et al. (2003) introduced SCoTLASS (see Appendix B) to obtain
modified principal components with zero weights. SCoTLASS works like PCA in
that it finds the weights that maximize the variance, but it forces the weights for
the less important tailoring variables to be set at zero (i.e., these variables are
removed from summary). This method is useful when one wishes to retain those
tailoring variables that are responsible for most of the variance, and to remove
those variables that contribute little to the variance.
A third method only produces summaries strongly related to the response
(this is not necessarily the case with the PCA and SCoTLASS). This method is
called Partial Least Squares (PLS) (see Haenlein and Kaplan, 2004 and Appendix
B). PLS is a popular technique that uses the response variable to construct linear
combinations of the tailoring variables. Specifically, PLS seeks weighted
combinations of the tailoring variables that account for the variance and have
high correlation with the response variable. Baer et al (2005) provide an
example of its use in the context of longitudinal studies.
Consider the case of choosing one of two treatments for each patient
based on pretreatment observations. Suppose the data consist of ( x, t , r ) on each
subject, where x is a collection of p potential tailoring (pretreatment) variables,
t is an indicator variable denoting assigned treatment and r denotes the
response. Consider the construction of a number of summaries of the tailoring
variables, say ( z1 ,...., zq ) . For the ease of interpretation, each summary, z j ,
j=1,…,q might be a weighted combination of the original variables:
9
Murphy
z j  c j1 x1  ...  c jp x p
where
(2)
c ji , i=1,…,p, are weights. In forming the summaries, x ’s that are not
helpful in decision making should be eliminated. SCoTLASS can be used to obtain
modified summaries with zero weights. Here is how SCoTLASS works when only
the first summary (the z1 ) is derived. SCoTLASS works like PCA in that it finds
the weights (the c1 j ’s) that maximize the variance,
p
Var  c1 j x j
j 1
subject to the constraint,
c 
p
j 1
1j
2
 1 , but it achieves sparsity in the weights by
adding an additional constraint on the weights:
p
| c
j 1
1j
| u
where the variable u is specified by the user. For sufficiently small value of u ,
some of the c1 j ’s will be set exactly equal to zero, hence the corresponding
tailoring variables are removed from the summary.
As discussed in the text, PLS is a popular technique that uses the
response variable to construct summaries. Specifically, PLS seeks weighted
combinations that account for the variance in the x ’s and have high correlation
with the response variable. In particular, the first PLS component finds the
10
Murphy
weights (the c1 j ’s) that maximize the product of the squared correlation and
variance:
 p

 p

Corr  y,  c1 j x j  Var   c1 j x j 
 j 1

 j 1

2
subject to the constraint that
c 
p
j 1
1j
2
 1 . Each PLS summary is a linear
combination of all of the original (pretreatment) variables (non-zero weights), so
there is no variable selection.
Needed Research and Collaboration: Much of the current work in feature
construction and variable selection has focused on prediction rather than
decision making. In the identification and formation of potential tailoring
variables, ease and cost of collection (i.e., feasibility), reliability, and
interpretability (by both clinicians and patients) and discriminatory power must
be balanced. Those variables that are most time consuming and expensive to
collect should be very highly informative to decision making. Thus, an important
challenge for collaborative teams is to evaluate both the clinical and cost utility of
the above methods (PCA, SCoTLASS and PLS) for decision making rather than
simple prediction. These evaluations are particularly needed for sequential
clinical decision making where there appears to have been little practical
application of these approaches.
A second challenge is to develop methods that will produce summaries
targeted at decision making, rather than response prediction. Generalization of
above mentioned techniques, e.g., PCA, SCoTLASS and PLS, may be fruitful.
11
Murphy
These summaries should interact with treatment, rather than simply predict
response. Variables that interact with treatment decisions are known as
moderators (Kramer et al., 2001, 2002) or effect modifiers (Rothman &
Greenland, 1998). Additionally the interaction should be sufficiently strong so
that for patients with certain values of the pre- or during- treatment variable one
of the treatment options or modifications is best while for patients with other
values of the pretreatment variable another option or modification is best. These
are called qualitative interactions (Peto, 1982; Byar, 1985). Another way to
express this is that a tailoring variable might be of little use, even if it is a
powerful predictor of response, if it does not predict differential response to
treatment.
A third challenge is to employ variable selection in creating summaries
when the goal is decision making. Present variable selection methods commonly
used in engineering and statistics such as subset selection, linear shrinkage
methods (Breiman 1995, Tibshirani 1996), Bayesian model averaging (Hoeting,
Madigan, Raftery & Volinsky 1999) and WINNOW (Littlestone 1988) have been
developed only for the purpose of prediction. Lastly, these methods need to be
generalized for use in sequential decision making so as to predict differential
response to treatment.
12
Murphy
Supplemental References
Box GEP, Hunter WG, Hunter JS (1978): Statistics for experimenters: An
introduction to design, data analysis, and model building. Wiley: New York.
Breiman L (1995). Better subset regression using the nonnegative garrote.
Technometrics, 37: 373-384.
Byar DP (1985). Assessing apparent treatment-covariate interactions in
randomized clinical trials. Stat Med 4: 255-263.
Collins LM, Murphy SA Nair V Strecher V (2005). A strategy for optimizing and
evaluating behavioral interventions. Ann of Behav Med 30:65-73.
Dunteman G (1989): Principal Components Analysis (Quantitative Applications in
the Social Sciences). SAGE:Newbury Park, CA.
Haenlein M, Kaplan AM (2004). A beginner's guide to partial least squares
analysis. Understanding Statistics 3(4):283-297.
Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999), Bayesian model
averaging: a tutorial (Pkg: p382-417). Statistical Science, 14 (4) : 382-401.
Jolliffe IT (2002): Principal Component Analysis, Second Edition, Springer: New
York.
Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component
technique based on the LASSO Journal of Computational and Graphical
Statistics, 12:531-547
Littlestone N (1988). Learning quickly when irrelevant attributes abound: a new
linear-threshold algorithm. Machine Learning 2:285-318.
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society, Series B, Methodological 58:267-288.
13
Download