Sequential, Multiple Assignment, Randomized Trials and Treatment Policies S.A. Murphy MUCMD, 08/10/12 Outline • Treatment Policies • Sequential Multiple Assignment Randomized Trials, “SMART Studies” • Q-Learning/ Fitted Q Iteration • Where we are going…… 2 Treatment Policies are individually tailored treatments, with treatment type and dosage changing according to patient outcomes. Operationalize sequential decisions in clinical practice. k Stages for each individual Observation available at jth stage Action at jth stage (usually a treatment) 3 Example of a Treatment Policy •Adaptive Drug Court Program for drug abusing offenders. •Goal is to minimize recidivism and drug use. •Marlowe et al. (2008, 2009, 2011) 4 Adaptive Drug Court Program non-responsive low risk As-needed court hearings + standard counseling As-needed court hearings + ICM non-compliant high risk non-responsive Bi-weekly court hearings + standard counseling Bi-weekly court hearings + ICM non-compliant Court-determined disposition 5 Usually k=2 Stages (Finite Horizon=2) Goal: Use a training set of n trajectories, each of the form (a trajectory per subject) to construct a treatment policy that outputs the actions, ai. The treatment policy should maximize total reward. The treatment policy is a sequence of two decision rules: 6 Randomized Trials What is a sequential, multiple assignment, randomized trial (SMART)? Each subject proceeds through multiple stages of treatment; randomization takes place at each stage. Exploration, no exploitation. Usually 2-3 treatment stages 7 Pelham’s ADHD Study A1. Continue, reassess monthly; randomize if deteriorate Yes 8 weeks A. Begin low-intensity behavior modification A2. Augment with other treatment AssessAdequate response? No Random assignment: A3. Increase intensity of present treatment Random assignment: B1. Continue, reassess monthly; randomize if deteriorate 8 weeks B. Begin low dose medication AssessAdequate response? B2. Increase intensity of present treatment Random assignment: No B3. Augment with other treatment 8 Oslin’s ExTENd Study Naltrexone 8 wks Response Random assignment: Early Trigger for Nonresponse Random assignment: TDM + Naltrexone CBI Nonresponse CBI +Naltrexone Random assignment: 8 wks Response Naltrexone Random assignment: TDM + Naltrexone Late Trigger for Nonresponse Random assignment: Nonresponse CBI CBI +Naltrexone 9 Jones’ Study for Drug-Addicted Pregnant Women rRBT 2 wks Response Random assignment: tRBT Random assignment: tRBT tRBT Nonresponse eRBT Random assignment: 2 wks Response aRBT Random assignment: rRBT rRBT Random assignment: Nonresponse tRBT rRBT Usually 2 Stages (Finite Horizon=2) Goal: Use a training set of n trajectories, each of the form X 1; A1; X 2; A2; X 3 (a trajectory per subject) to construct treatment policy. The treatment policy should maximize total reward. Aj is a randomized action with known randomization probability. Here binary actions 11 with P[Aj=1]=P[Aj=-1]=.5 Secondary Data Analysis: Q-Learning •Q-Learning, Fitted Q Iteration, Approximate Dynamic Programming (Watkins, 1989; Ernst et al., 2005; Murphy, 2003; Robins, 2004) • This results in a proposal for an optimal treatment policy. •A subsequent randomized trial would evaluate the proposed treatment policy. 12 2 Stages—Terminal Reward Y Goal: Use training set to construct d1 (X 1); d2 (X 1; A 1; X 2 ) for which the average value, E d1 ;d2 [Y ], is maximal. The maximal average value is V opt = max E d1 ;d2 [Y ] d1 ;d2 13 Idea behind Q-Learning/Fitted Q ¯ ¸¸ ¯ = E max E max E [Y jX 1 ; A 1 ; X 2 ; A 2 = a2 ] ¯¯X 1 ; A 1 = a1 · V opt · a1 a2 ² Stage 2 Q-function Q2 (X 1 ; A 1 ; X 2 ; A 2 ) = E [Y jX 1 ; A 1 ; X 2 ; A 2 ] ¯ · · ¸¸ ¯ V opt = E max E max Q2 (X 1 ; A 1 ; X 2 ; a2 ) ¯¯X 1 ; A 1 = a1 a1 a2 ¯ ¸ ¯ ² Stage 1 Q-function Q1 (X 1 ; A 1 ) = E maxa 2 Q2 (X 1 ; A 1 ; X 2 ; a2 ) ¯¯X 1 ; A 1 · · ¸ V opt = E max Q1 (X 1 ; a1 ) a1 14 Simple Version of Fitted Q-iteration – Use regression at each stage to approximate Q-function. • Stage 2 regression: Regress Y on obtain to ^= ® Q ^ 2T S20 + ¯^2T S2 a2 • Arg-max over a2 yields 15 Value for subjects entering stage 2: • • is a predictor of maxa2 Q2 (X 1 ; A 1 ; X 2; a2) is the dependent variable in the stage 1 regression for patients moving to stage 2 16 Simple Version of Fitted Q-iteration – • Stage 1 regression: Regress to obtain on • Arg-max over a1 yields 17 Decision Rules: 18 Pelham’s ADHD Study A1. Continue, reassess monthly; randomize if deteriorate Yes 8 weeks A. Begin low-intensity behavior modification A2. Augment with other treatment AssessAdequate response? No Random assignment: A3. Increase intensity of present treatment Random assignment: B1. Continue, reassess monthly; randomize if deteriorate 8 weeks B. Begin low dose medication AssessAdequate response? B2. Increase intensity of present treatment Random assignment: No B3. Augment with other treatment 19 ADHD 138 trajectories of form: (X1, A1, R1, X2, A2, Y) • Y = end of year school performance • R1=1 if responder; =0 if non-responder • X2 includes the month of non-response, M2, and a measure of adherence in stage 1 (S2 ) – S2 =1 if adherent in stage 1; =0, if non-adherent • X1 includes baseline school performance, Y0 , whether medicated in prior year (S1), ODD (O1) – S1 =1 if medicated in prior year; =0, otherwise. 20 Q-Learning using data on children with ADHD • Stage 2 regression for Y: (1; Y0 ; S1 ; O1 ; A 1 ; M 2 ; S2 )®2 + A 2 (¯21 + A 1 ¯22 + S2 ¯23 ) • Decision rule is “ if child is nonresponding then intensify initial treatment if ¡ :72 + :05A 1 + :97S2 > 0 , otherwise augment” 21 Q-Learning using data on children with ADHD • Decision rule is “if child is non-responding then intensify initial treatment if . + :05A 1 + :97S2 > 0 , otherwise augment” ¡ :72 Decision Rule for Non-responding Children Initial Treatment =BMOD Initial Treatment=MED Adherent Intensify Intensify Not Adherent Augment Augment 22 ADHD Example • Stage 1 regression for (1; Y0 ; S1 ; O1 )®1 + A 1 (¯11 + S1 ¯12 ) • Decision rule is, “Begin with BMOD if . ¡ :32S1 > 0 , otherwise begin with MED” :17 23 Q-Learning using data on children with ADHD • Decision rule is “Begin with BMOD if . ¡ :32S1 > 0, otherwise begin with :17 MED” Initial Decision Rule Initial Treatment Prior MEDS MEDS No Prior MEDS BMOD 24 ADHD Example • The treatment policy is quite decisive. We developed this treatment policy using a trial on only 138 children. Is there sufficient evidence in the data to warrant this level of decisiveness?????? • Would a similar trial obtain similar results? • There are strong opinions regarding how to treat ADHD. • One solution –use confidence intervals. 25 ADHD Example Treatment Decision for Non-responders. Positive Treatment Effect Intensify 90% Confidence Interval Adherent to BMOD (-0.08, 0.69) Adherent to MED (-0.18, 0.62) Non-adherent to BMOD (-1.10, -0.28) Non-adherent to MED (-1.25, -0.29) 26 ADHD Example Initial Treatment Decision: Positive Treatment Effect BMOD 90% Confidence Interval Prior MEDS (-0.48, 0.16) No Prior MEDS (-0.05, 0.39) 27 Proposal for Treatment Policy IF medication was not used in the prior year THEN begin with BMOD; ELSE select either BMOD or MED. IF the child is nonresponsive and was nonadherent, THEN augment present treatment; ELSE IF the child is nonresponsive and was adherent, THEN select either intensification or augmentation of current treatment. 28 Where are we going?...... • Increasing use of wearable computers (e.g smart phones, etc.) to both collect real time data and provide real time treatment. • We are working on the design of studies involving randomization (soft-max or epsilongreedy choice of actions) to develop/ continually improve treatment policies. • Need confidence measures for infinite horizon problems 29 This seminar can be found at: http://www.stat.lsa.umich.edu/~samurphy/ seminars/MUCMD.08.10.12.pdf This seminar is based on work with many collaborators, some of which are: L. Collins, E. Laber, M. Qian, D. Almirall, K. Lynch, J. McKay, D. Oslin, T. Ten Have, I. Nahum-Shani & B. Pelham. Email with questions or if you would like a copy: samurphy@umich.edu 30