Sequential, Multiple Assignment, Randomized Trials and Treatment Policies S.A. Murphy UAlberta, 09/28/12 Outline • Treatment Policies • Data Sources • Q-Learning • Confidence Intervals 2 Treatment Policies are individually tailored treatments, with treatment type and dosage changing according to patient outcomes. Operationalize sequential decisions in clinical practice. k Stages for each individual Observations available at jth stage Action at jth stage (usually a treatment) 3 Example of a Treatment Policy •Adaptive Drug Court Program for drug abusing offenders. •Goal is to minimize recidivism and drug use. •Marlowe et al. (2008, 2009, 2011) 4 Adaptive Drug Court Program non-responsive low risk As-needed court hearings + standard counseling As-needed court hearings + ICM non-compliant high risk non-responsive Bi-weekly court hearings + standard counseling Bi-weekly court hearings + ICM non-compliant Court-determined disposition 5 k=2 Stages The treatment policy is a sequence of two decision rules: Goal: Use a data set of n trajectories, each of the form (a trajectory per subject) to construct a treatment policy. The treatment policy should produce maximal reward, 6 Why should a Machine Learning Researcher be interested in Treatment Policies? • The dimensionality of the data available for constructing decision rules accumulates at an exponential rate with the stage. •Need both feature construction as well as feature selection. 7 Outline • Treatment Policies • Data Sources • Q-Learning • Confidence Intervals 8 Experimental Data Data from sequential, multiple assignment, randomized trials: n subjects each yielding a trajectory. For 2 stages, the trajectory for each subject is of the form (Exploration, no exploitation.) Aj is a randomized treatment action with known randomization probability. Here binary actions with P[Aj=1]=P[Aj=-1]=.5 9 Pelham’s ADHD Study A1. Continue, reassess monthly; randomize if deteriorate Yes 8 weeks A. Begin low-intensity behavior modification A2. Augment with other treatment AssessAdequate response? No Random assignment: A3. Increase intensity of present treatment Random assignment: B1. Continue, reassess monthly; randomize if deteriorate 8 weeks B. Begin low dose medication AssessAdequate response? B2. Increase intensity of present treatment Random assignment: No B3. Augment with other treatment 10 Oslin’s ExTENd Study Naltrexone 8 wks Response Random assignment: Early Trigger for Nonresponse Random assignment: TDM + Naltrexone CBI Nonresponse CBI +Naltrexone Random assignment: 8 wks Response Naltrexone Random assignment: TDM + Naltrexone Late Trigger for Nonresponse Random assignment: Nonresponse CBI CBI +Naltrexone 11 Jones’ Study for Drug-Addicted Pregnant Women rRBT 2 wks Response Random assignment: tRBT Random assignment: tRBT tRBT Nonresponse eRBT Random assignment: 2 wks Response aRBT Random assignment: rRBT rRBT Random assignment: Nonresponse tRBT rRBT Kasari Autism Study JAE+EMT Yes 12 weeks A. JAE+ EMT AssessAdequate response? JAE+EMT+++ Random assignment: No JAE+AAC Random assignment: Yes 12 weeks B. JAE + AAC B!. JAE+AAC AssessAdequate response? No B2. JAE +AAC ++ 13 Newer Experimental Designs • Using Smart phones to collect data, Xi’s, in real time and to provide treatments, Ai’s, in real time to n subjects. The treatments, Ai’s, are randomized among a feasible set of treatment options. – The number of treatment stages is very large—want a Markovian property – Feature construction of states in Markov process 14 Observational data • Longitudinal Studies • Patient Registries • Electronic Medical Record Data 15 Outline • Treatment Policies • Data Sources • Q-Learning/ Fitted Q-Iteration • Confidence Intervals 16 Secondary Data Analysis: Q-Learning •Q-Learning, Fitted Q-Iteration, Approximate Dynamic Programming (Watkins, 1989; Ernst et al., 2005; Murphy, 2003; Robins, 2004) • This results in a proposal for an optimal treatment policy. •A subsequent randomized trial would evaluate the proposed treatment policy. 17 2 Stages—Terminal Reward Y Goal: Use data to construct d1 (X 1); d2 (X 1; A 1; X 2 ) for which the average value, E d1 ;d2 [Y ], is maximal. The maximal average value is V opt = max E d1 ;d2 [Y ] d1 ;d2 18 Idea behind Q-Learning/Fitted Q ¯ ¸¸ ¯ = E max E max E [Y jX 1 ; A 1 ; X 2 ; A 2 = a2 ] ¯¯X 1 ; A 1 = a1 · V opt · a1 a2 ² Stage 2 Q-function Q2 (X 1 ; A 1 ; X 2 ; A 2 ) = E [Y jX 1 ; A 1 ; X 2 ; A 2 ] ¯ · · ¸¸ ¯ V opt = E max E max Q2 (X 1 ; A 1 ; X 2 ; a2 ) ¯¯X 1 ; A 1 = a1 a1 a2 ¯ ¸ ¯ ² Stage 1 Q-function Q1 (X 1 ; A 1 ) = E maxa 2 Q2 (X 1 ; A 1 ; X 2 ; a2 ) ¯¯X 1 ; A 1 · · ¸ V opt = E max Q1 (X 1 ; a1 ) a1 19 Optimal Treatment Policy The optimal treatment policy is (d1 ; d2 ) where d2 (X 1 ; A 1 ; X 2 ) = arg max Q2 (X 1 ; A 1 ; X 2 ; a2 ) a1 d1 (X 1 ) = arg max Q1 (X 1 ; a1 ) a1 20 Simple Version of Fitted Q-iteration – Use regression at each stage to approximate Q-function. • Stage 2 regression: Regress Y on obtain T 0 T to ^2 = ® Q ^ 2 S2 + ¯^2 S2 a2 • Arg-max over a2 yields 21 Value for subjects entering stage 2: • • is a predictor of maxa2 Q2 (X 1 ; A 1 ; X 2; a2) is the dependent variable in the stage 1 regression for patients who moved to stage 2 22 Simple Version of Fitted Q-iteration – • Stage 1 regression: Regress to obtain on • Arg-max over a1 yields 23 Decision Rules: 24 Pelham’s ADHD Study A1. Continue, reassess monthly; randomize if deteriorate Yes 8 weeks A. Begin low-intensity behavior modification A2. Augment with other treatment AssessAdequate response? No Random assignment: A3. Increase intensity of present treatment Random assignment: B1. Continue, reassess monthly; randomize if deteriorate 8 weeks B. Begin low dose medication AssessAdequate response? B2. Increase intensity of present treatment Random assignment: No B3. Augment with other treatment 25 ADHD 138 trajectories of form: (X1, A1, R1, X2, A2, Y) • X1 includes baseline school performance, Y0 , whether medicated in prior year (S1), ODD (O1) – S1 =1 if medicated in prior year; =0, otherwise. • R1=1 if responder; =0 if non-responder • X2 includes the month of non-response, M2, and a measure of adherence in stage 1 (S2 ) – S2 =1 if adherent in stage 1; =0, if non-adherent • Y = end of year school performance 26 Q-Learning using data on children with ADHD • Stage 2 regression for Y: (1; Y0 ; S1 ; O1 ; A 1 ; M 2 ; S2 )®2 + A 2 (¯21 + A 1 ¯22 + S2 ¯23 ) • Estimated decision rule is “ if child is non-responding then intensify initial treatment if ¡ :72 + :05A 1 + :97S2 > 0, otherwise augment” 27 Q-Learning using data on children with ADHD • Decision rule is “if child is non-responding then intensify initial treatment if . + :05A 1 + :97S2 > 0 , otherwise augment” ¡ :72 Decision Rule for Non-responding Children Initial Treatment =BMOD Initial Treatment=MED Adherent Intensify Intensify Not Adherent Augment Augment 28 ADHD Example • Stage 1 regression for (1; Y0 ; S1 ; O1 )®1 + A 1 (¯11 + S1 ¯12 ) • Decision rule is, “Begin with BMOD if . ¡ :32S1 > 0 , otherwise begin with MED” :17 29 Q-Learning using data on children with ADHD • Decision rule is “Begin with BMOD if . ¡ :32S1 > 0, otherwise begin with :17 MED” Initial Decision Rule Initial Treatment Prior MEDS MEDS No Prior MEDS BMOD 30 ADHD Example • The treatment policy is quite decisive. We developed this treatment policy using a trial on only 138 children. Is there sufficient evidence in the data to warrant this level of decisiveness?????? • Would a similar trial obtain similar results? • There are strong opinions regarding how to treat ADHD. • One solution –use confidence intervals. 31 Outline • Treatment Policies • Data Sources • Q-Learning • Confidence Intervals 32 ADHD Example Treatment Decision for Non-responders. Positive Treatment Effect Intensify 90% Confidence Interval Adherent to BMOD (-0.08, 0.69) Adherent to MED (-0.18, 0.62) Non-adherent to BMOD (-1.10, -0.28) Non-adherent to MED (-1.25, -0.29) 33 ADHD Example Initial Treatment Decision: Positive Treatment Effect BMOD 90% Confidence Interval Prior MEDS (-0.48, 0.16) No Prior MEDS (-0.05, 0.39) 34 Proposal for Treatment Policy IF medication was not used in the prior year THEN begin with BMOD; ELSE select either BMOD or MED. IF the child is nonresponsive THEN IF child was non-adherent, THEN augment present treatment; ELSE IF child was adherent, THEN select either intensification or augmentation of current treatment. 35 Confidence Intervals Constructing confidence intervals concerning treatment effects at stage 2 and stage 1. The stage 2 is classical regression (at least if S2 ; S20 is low dimensional); constructing confidence intervals is standard. Constructing confidence intervals for the treatment effects at stage 1 is challenging. 36 Confidence Intervals for Stage 1 Treatment Effects ^2 is nonChallenge: Stage 2 estimated value, V smooth in the estimators from the stage 2 regression--due to non-differentiability of the maximization: ^2 V = T 0 ® ^ 2 S2 T ^ + max ¯2 S2 a2 = T 0 ® ^ 2 S2 T ^ + j ¯2 S2 j a2 37 Non-regularity • The estimated policy can change abruptly from training set to training set. Standard approximations used to construct confidence intervals perform poorly (Shao, 1994; Andrews, 2000). • Problematic area in parameter space is around for which P[¯ T S ¼ 0] > 0 2 2 • We combined a local generalization-type error bound with standard statistical confidence interval to produce a valid confidence interval. 38 Why is this non-smoothness, and the resulting inferential problems, relevant to high dimensional machine learning research? • Sparsity assumptions in high dimensional data analysis • Thresholding • Nonsmoothness at important parameter values 39 Where are we going?...... • Increasing use of wearable computers (e.g smart phones, etc.) to both collect real time data and provide real time treatment. • We are working on the clinical trial designs involving randomization (soft-max or epsilongreedy choice of actions) so as to develop/ continually improve treatment policies. • Need confidence measures for infinite horizon problems 40 This seminar can be found at: http://www.stat.lsa.umich.edu/~samurphy/ seminars/UAlberta.09.28.12.pdf This seminar is based on work with many collaborators, some of which are: L. Collins, E. Laber, M. Qian, D. Almirall, K. Lynch, J. McKay, D. Oslin, T. Ten Have, I. Nahum-Shani & B. Pelham. Email with questions or if you would like a copy: samurphy@umich.edu 41