Bristol04.10.12

advertisement
Confidence Intervals,
Q-Learning and
Dynamic Treatment Regimes
S.A. Murphy
Time for Causality – Bristol
April, 2012
1
Outline
• Dynamic Treatment Regimes
• Example Experimental Designs & Challenges
• Q-Learning & Challenges
2
Dynamic treatment regimes are individually tailored
treatments, with treatment type and dosage changing
according to patient outcomes. Operationalize clinical
practice.
k Stages for one individual
Observation available at jth stage
Action at jth stage (usually a treatment)
3
Example of a Dynamic Treatment
Regime
•Adaptive Drug Court Program for drug
abusing offenders.
•Goal is to minimize recidivism and drug
use.
•Marlowe et al. (2008, 2011)
4
Adaptive Drug Court Program
non-responsive
low risk
As-needed court hearings
+ standard counseling
As-needed court hearings
+ ICM
non-compliant
high risk
non-responsive
Bi-weekly court hearings
+ standard counseling
Bi-weekly court hearings
+ ICM
non-compliant
Court-determined
disposition
5
k=2 Stages
Goal: Construct decision rules that input information
available at each stage and output a recommended
decision; these decision rules should lead to a maximal
mean Y where Y is a function of
The dynamic treatment regime is a sequence of two
decision rules:
6
Outline
• Dynamic Treatment Regimes
• Example Experimental Designs & Challenges
• Q-Learning & Challenges
7
Data for Constructing the Dynamic Treatment Regime:
Subject data from sequential, multiple assignment,
randomized trials. At each stage subjects are
randomized among alternative options.
Aj is a randomized action with known randomization
probability.
binary actions with P[Aj=1]=P[Aj=-1]=.5
8
Pelham’s ADHD Study
A1. Continue, reassess monthly;
randomize if deteriorate
Yes
8 weeks
A. Begin low-intensity
behavior modification
A2. Add medication;
bemod remains stable but
medication dose may vary
AssessAdequate response?
No
Random
assignment:
Random
assignment:
A3. Increase intensity of bemod
with adaptive modifications based on impairment
B1. Continue, reassess monthly;
randomize if deteriorate
8 weeks
B. Begin low dose
medication
AssessAdequate response?
No
Random
assignment:
B2. Increase dose of medication
with monthly changes
as needed
B3. Add behavioral
treatment; medication dose
remains stable but intensity
of bemod may increase
with adaptive modifications
9
based on impairment
Oslin’s ExTENd Study
Naltrexone
8 wks Response
Random
assignment:
Early Trigger for
Nonresponse
Random
assignment:
TDM + Naltrexone
CBI
Nonresponse
CBI +Naltrexone
Random
assignment:
Naltrexone
8 wks Response
Random
assignment:
TDM + Naltrexone
Late Trigger for
Nonresponse
Random
assignment:
Nonresponse
CBI
CBI +Naltrexone
10
Kasari Autism Study
JAE+EMT
Yes
12 weeks
A. JAE+ EMT
AssessAdequate response?
JAE+EMT+++
Random
assignment:
No
JAE+AAC
Random
assignment:
Yes
12 weeks
B. JAE + AAC
B!. JAE+AAC
AssessAdequate response?
No
B2. JAE +AAC ++
11
Jones’ Study for Drug-Addicted
Pregnant Women
rRBT
2 wks Response
Random
assignment:
tRBT
Random
assignment:
tRBT
tRBT
Nonresponse
eRBT
Random
assignment:
2 wks Response
aRBT
Random
assignment:
rRBT
rRBT
Random
assignment:
Nonresponse
tRBT
rRBT
Challenges
• Goal of trial may differ
– Experimental designs for settings involving new
drugs (cancer, ulcerative colitis)
– Experimental designs for settings involving already
approved drugs/treatments
• Choice of Primary Hypothesis/Analysis &
Secondary Hypothesis/Analysis
• Longitudinal/Survival Primary Outcomes
• Sample Size Formulae
13
Outline
• Dynamic Treatment Regimes
• Example Experimental Designs & Challenges
• Q-Learning & Challenges
14
Secondary Analysis: Q-Learning
•Q-Learning (Watkins, 1989; Ernst et al., 2005;
Murphy, 2005) (a popular method from
computer science)
•Optimal nested structural mean model
(Murphy, 2003; Robins, 2004)
• The first method is an inefficient version of the
second method when (a) linear models are used, (b)
each stages’ covariates include the prior stages’
covariates and (c) the treatment variables are coded
to have conditional mean zero.
15
k=2 Stages
Goal: Construct d1 (X 1); d2 (X 1; A 1; X 2 )
for which E d1 ;d2 [Y ] is maximal.
Vd1 ;d2 = E d1 ;d2 [Y] is called the value
and the maximal value is denoted by
V
opt
= max E d1 ;d2 [Y ]
d1 ;d2
16
Idea behind Q-Learning
¯
¸¸
¯
= E max E max E [Y jX 1 ; A 1 ; X 2 ; A 2 = a2 ] ¯¯X 1 ; A 1 = a1
·
V opt
·
a1
a2
² Stage 2 Q-function Q2 (X 1 ; A 1 ; X 2 ; A 2 ) = E [Y jX 1 ; A 1 ; X 2 ; A 2 ]
¯
·
·
¸¸
¯
V opt = E max E max Q2 (X 1 ; A 1 ; X 2 ; a2 ) ¯¯X 1 ; A 1 = a1
a1
a2
¯
¸
¯
² Stage 1 Q-function Q1 (X 1 ; A 1 ) = E maxa 2 Q2 (X 1 ; A 1 ; X 2 ; a2 ) ¯¯X 1 ; A 1
·
·
¸
V opt = E max Q1 (X 1 ; a1 )
a1
17
Simple Version of Q-Learning –
There is a regression for each stage.
• Stage 2 regression: Regress Y on
obtain
to
• Stage 1 regression: Regress
obtain
to
on
18
for subjects entering stage 2:
•
is a predictor of maxa2 Q2 (X 1 ; A 1 ; X 2; a2 )
•
is the predicted end of stage 2 response when the
stage 2 treatment is equal to the “best” treatment.
•
is the dependent variable in the stage 1 regression
for patients moving to stage 2
19
A Simple Version of Q-Learning –
• Stage 2 regression, (using Y as dependent variable)
yields
• Arg-max over a2 yields
20
A Simple Version of Q-Learning –
• Stage 1 regression, (using
yields
as dependent variable)
• Arg-max over a1 yields
21
Confidence Intervals
^
Y
T
^
+ max ¯2 S2 a2
=
T 0
®
^ 2 S2
=
®
^ 2T S20 + j ¯^2T S2 j
a2
22
Non-regularity
•
^
Y
=
®
^ 2T S20 + j ¯^2T S2 j
• Limiting distribution of
(Robins, 2004)
p
n( ¯^1 ¡ ¯1 ) is non-regular
• Problematic area in parameter space is around ¯2
for which P[¯ T S ¼ 0] > 0
2
2
• Standard asymptotic approaches invalid without
modification (see Shao, 1994; Andrews, 2000).
23
Non-regularity –
•
p ^
Problematic term in n( ¯1 ¡ ¯1 ) is
·
T
c
§^ ¡1 1 Pn
where B 1 =
¡
¡p T
p T
B 1 j n¯^2 S2 )j ¡ j n¯2 S2 j
¸
¢
¢
0 T
T T
(S1 ) ; A 1 S1
• This term is well-behaved if ¯2T S2 is bounded away
from zero.
• We want to form an adaptive confidence interval.
24
Idea from Econometrics
• In nonregular settings in Econometrics there are a
fixed number of easily identified “bad” parameter
values at which you have nonregular behavior of the
estimator
• Use a pretest (e.g. an hypothesis test) to test if you
are near a “bad” parameter value; if the pretest
rejects, use standard critical values to form
confidence interval; if the pretest accepts, use the
maximal critical value over all possible local
alternatives. (Andrews & Soares, 2007; Andrews &
Guggenberger, 2009)
25
Our Approach
• Construct smooth upper and lower bounds so that for
p
all n
T
Ln · c
n( ¯^1 ¡ ¯1 ) · Un
• The upper/lower bounds use a pretest:
p
• Embed in the formula for n( ¯^1 ¡ ¯1 ) a pretest of
H0 : sT2 ¯2 = 0
based on
Tn (s2 ) =
2
T ^
n s2 ¯ 2
sT2 §^ 2 s 2
(
)
26
Non-regularity
• The upper bound adds to c
T
¡
+
p
n( ¯^1 ¡ ¯1 ):
¡ 1
^
c § 1 Pn B 1 Z n (b)1T n ( S2 ) ·
T
¯
¯
¸n¯
¡ 1
^
sup c § 1 Pn B 1 Z n (b)1T n ( S2 ) ·
T
b
b=
¸
p
n ¯2
n
27
Confidence Interval
• Let Un( b) and L (nb) be the bootstrap analogues of Un
and L n
• PM is the probability with respect to the bootstrap
weights. ^l is the ±=2 quantile of L (nb) ; u
^ is the 1 ¡ ±=2
( b)
quantile of Un .
Theorem: Assume moment conditions, invertible varcovariance matrices and ¸ n = loglogn, then
³
PM
´
p
p
cT ¯^1 ¡ u
^= n · cT ¯1¤ · cT ¯^1 ¡ ^l= n ¸ 1 ¡ ± + oP (1)
28
Adaptation Theorem: Assuming finite moment
conditions and invertible var-covariance
matrices are invertible, ¸ n ! 1 ; ¸ n = o(n) ,
P[S2T ¯2 6
= 0] = 1
then
T
Ln; c
p
(b)
(b)
^
n( ¯1 ¡ ¯1 ); Un ; L n ; Un
each converge, in distribution, to the same
limiting distribution (the last two in
probability).
29
Empirical Study
• Some Competing Methods
• Soft-thresholding (ST) Chakraborty et al. (2009)
• Centered percentile bootstrap (CPB)
• Plug-in pretesting estimator (PPE)
• Generative Models
• Nonregular (NR): P[S2T ¯2
•
= 0] > 0
Nearly Nonregular (NNR):P[ST ¯2 ¼ 0] > 0
2
• Regular (R): P[S2T ¯2
= 0] = 0
• n=150, 1000 Monte Carlo Reps, 1000 Bootstrap Samples30
Example –Two Stages, two treatments per stage
Type
CPB
ST
PPE
ACI
NR
.93*
.95 (.34)
.93*
.99(.50)
R
.94(.45)
.92*
.93*
.95(.48)
NR
.93*
.76*
.90*
.96(.49)
NNR
.93*
.76*
.90*
.97(.49)
Size(width)
31
Example –Two Stages, three treatments in stage 2
Type
CPB
PPE
ACI
NR
.93*
.93*
1.0(.72)
R
.94(.56)
.92*
.96(.63)
NR
.89*
.86*
.97(.67)
NNR
.90*
.86*
.97(.67)
Size(width)
32
Pelham’s ADHD Study
A1. Continue, reassess monthly;
randomize if deteriorate
Yes
8 weeks
A. Begin low-intensity
behavior modification
A2. Add medication;
bemod remains stable but
medication dose may vary
AssessAdequate response?
No
Random
assignment:
Random
assignment:
A3. Increase intensity of bemod
with adaptive modifications based on impairment
B1. Continue, reassess monthly;
randomize if deteriorate
8 weeks
B. Begin low dose
medication
AssessAdequate response?
No
Random
assignment:
B2. Increase dose of medication
with monthly changes
as needed
B3. Add behavioral
treatment; medication dose
remains stable but intensity
of bemod may increase
with adaptive modifications
33
based on impairment
ADHD Example
• (X1, A1, R1, X2, A2, Y)
– Y = end of year school performance
– R1=1 if responder; =0 if non-responder
– X2 includes the month of non-response, M2, and a
measure of adherence in stage 1 (S2 )
– S2 =1 if adherent in stage 1; =0, if non-adherent
– X1 includes baseline school performance, Y0 ,
whether medicated in prior year (S1), ODD (O1)
– S1 =1 if medicated in prior year; =0, otherwise.
34
ADHD Example
• Stage 2 regression for Y:
(1; Y0 ; S1 ; O1 ; A 1 ; M 2 ; S2 )®2 +
A 2 (¯21 + A 1 ¯22 + S2 ¯23 )
• Stage 1 outcome: R1 Y + (1 ¡ R1 ) Y^
^ = (1; Y0 ; S1 ; O1 ; A 1 ; M 2 ; S2 ) ®
Y
^2+
j ¯^21 + A 1 ¯^22 + S2 ¯^23 j
35
ADHD Example
• Stage 1 regression for
(1; Y0 ; S1 ; O1 )®1 + A 1 (¯11 + S1 ¯12 )
• Interesting stage 1 contrast: is it important to
know whether the child was medicated in the
prior year (S1=1) to determine the best initial
treatment in the sequence?
36
ADHD Example
• Stage 1 treatment effect when S1=1: ¯11 + ¯12
• Stage 1 treatment effect when S1=0:
¯11
90% ACI
¯11 + ¯12
¯11
(-0.48, 0.16)
(-0.05, 0.39)
37
Dynamic Treatment Regime
Proposal
IF medication was not used in the prior year
THEN begin with BMOD;
ELSE select either BMOD or MED.
IF the child is nonresponsive and was nonadherent, THEN augment present treatment;
ELSE IF the child is nonresponsive and was
adherent, THEN select intensification of
current treatment.
38
38
Challenges
• There are multiple ways to form
^ ; what are the
Y
pros and cons?
•Improve adaptation by a pretest of
H0 : ¯2 = 0?
• High dimensional data; investigators want to
collect real time data
• Feature construction & Feature selection
•Many stages or infinite horizon
39
This seminar can be found at:
http://www.stat.lsa.umich.edu/~samurphy/
seminars/Bristol04.10.12.ppt
Email Eric Laber or me for questions:
laber@stat.ncsu.edu or samurphy@umich.edu.edu
40
Download