LECTURE 21

advertisement
Subgroup Analysis
“Fun to look at but don’t believe
them!” (P Sleight, 2000)
Deciding on analysis after looking
at the data is “dangerous, useful,
and often done.” (IJ Good, 1983)
“We cannot necessarily, perhaps very
rarely, pass from (the overall result of a
trial) to stating exactly what effect the
treatment will have on a particular patient.
But there is surely, no way and no method
of deciding that.”
A. Bradford Hill, 1952
“Several factors are threatening to create a
path to “depersonalized” medicine despite
advances both in fundamental science and
clinical therapeutics. The tendency to
focus on statistics for the group rather
than the individual clinical features of the
patient is one factor.”
Horwitz RI, et al, (De)Personalized
Medicine, Science, 8 March 2013
Most trials report subgroup analyses
(median=4 subgroups)
Assmann SF, Lancet 2000; 355:1064-1069
Influence of Study Characteristics on
Reporting of Subgroups
• 44% of 469 randomized trials published in
major journals reported subgroup analyses
• Subgroup analyses were more likely to be
reported in high impact journals, non-surgical
trials, and large trials.
• There was an interaction between source of
funding and reporting of subgroups in trials
without significant overall results – a subgroup
finding!
Sun et al, BMJ, 2011
Conclusion: Sun et al, BMJ 2011
“Industry funded randomised controlled
trials, in the absence of statistically
significant primary outcomes, are more
likely to report subgroup analyses than
non-industry funded trials. Industry
funded trials less frequently test for
interaction than non-industry funded
trials. Subgroup analyses from industry
funded trials with negative results for
the primary outcome should be viewed
with caution.”
Aims of Subgroup Analysis
• To show consistency of trial findings for major
endpoints for important patient subsets
• To assess whether there are large differences
in the treatment effect among different types of
patients and, if so, identify hypotheses for
future research. (Assess the possibility of
treatment X subgroup or covariate
interactions)
Aim should not be to salvage a trial for which the overall
results were not as hoped for!
Subgroup Analysis by Astrological Birth
Sign
ISIS-2: Streptokinase and Aspirin for Acute MI
Percentage
Reduction in 5 Week
Vascular Mortality
Gemini or Libra
9%
(NS)
Other signs
-28%
(p < 0.00001)
Overall
-23%
(p < 0.00001)
“Lack of evidence of benefit just in one particular
subgroup is not good evidence of lack of benefit.”
Subgrouping Considerations
• Most trials are not designed to look at subgroups; sample
size and power based on overall treatment effect (power
is lower for subgroups than overall comparison).
• For subgroup analysis, it is often not clear how to control
for type 1 error (the more subgroups examined, the
greater the risk of a type 1 error).
• Not all subgroups of interest can be pre-specified (we are
not that smart).
• The subgroup may not be what it appears to be (it may be
a marker or label for some other characteristic).
Subgroup Definitions
• Proper subgroup – grouping of patients
according to baseline characteristics
• Improper subgroup – grouping of patients
according to characteristics following
randomization (i.e., factors potentially affected
by treatment)
• Interaction – evidence that treatment effects
differ by subgroup (quantitative versus
qualitative)
Yusuf S, et al., JAMA, 266:93-98, 1991.
A Priori and A Posteriori Subgroups
•
A priori: written in the protocol in advance of the
study (hypothesis driven)
•
A posteriori (post hoc or exploratory):
- specified … later
- before unblinding
- after unblinding
•
Both have inflated error rates, but more of a
problem with a posteriori defined subgroups.
INSIGHT START Protocol: Early Treatment
for HIV
“Subgroup analyses for the primary endpoint
and major secondary outcomes will be
performed to determine whether the treatment
effect (early versus deferred) differs
qualitatively across various baseline-defined
subgroups. Subgroup analysis will be
performed by age, gender, race/ethnicity,
geographic region, the presence of risk factors
for serious non-AIDS conditions, baseline
CD4+ cell count, baseline HIV RNA level,
calendar date of enrollment to assess the
effect of different treatment patterns that may
emerge, and the ART-regimen pre-specified at
the time of randomization….An overall test of
heterogeneity will provide evidence of whether
the magnitude of the treatment difference
varies across baseline subgroups.”
Pre- and Post-Stratification
and Subgroup Analysis
•
Pre-stratification variables are often, but not
always, subgroups of interest.
•
Aim of post-stratified analysis is to obtain a
“better” estimate of overall treatment effect.
•
Aim of subgroup analysis is to determine
whether treatment differences are consistent.
•
Like post-stratification, plans for subgroup
analysis should be pre-specified –– sometimes
there are surprises.
Subgrouping vs. Stratification
Grouping
Pre-stratification
Purpose
“insurance” for balance
in randomization
Post-stratification
increase the accuracy
of estimates of
treatment effect
Subgroups
check the consistency
of the treatment effect
Stratified Design for Comparing Treatments
Treatment
Stratum
A
B
1
m1A
m1B
m1
2
m2A
m2B
m2
3
m3A
m3B
m3
4
m4A
m4B
m4
na
nb
• Typical situation:
m1 ≠ m2 ≠ m3 ≠ m4
• Study is designed/powered based on na and nb
• Goal: miA = miB for all i.
Subgrouping Factors Determined
Experimentally
2 x 2 Factorial
Determined
by Randomization
{
A
No A
A
No A
B
No B
versus
Baseline
Characteristic
{
B
No B
NIH Policy on Subgroups
“When an NIH-defined Phase III clinical trial is
proposed, evidence must be reviewed to show
whether or not clinically important sex/gender
and race/ethnicity differences in the intervention
effect are to be expected.”
“Inclusion of the results of sex/gender,
race/ethnicity and relevant subpopulations
analyses is strongly encouraged in all
publication submissions.”
http://grants.nih.gov/grants/funding/women_min/guidelines_amended_10_2001,htm
ICH Guidelines on Subgroups
• If the size of the study permits, important
demographic or baseline value-defined
subgroups should be examined.
• These analyses are not intended to “salvage”
an otherwise unsupportive study.
• Subgroup analyses may suggest hypotheses
to be examined in other studies
• If there is a prior hypothesis about a subgroup,
this should be part of the statistical analysis
plan.
Issues to Consider
•
Appropriate significance level? Bonferroni method
may be too conservative – loss of power in a
situation where power is already low.
•
Should subgroup analysis be performed if the overall
result is negative? Much harder sell.
•
Should only a priori subgroups be described? Not
always that smart.
•
How should subgroup analyses be presented?
Interaction tests important.
•
Should analyses be based on post-randomization
measures? No
A Consumer’s (and Producer’s?)
Guide to Subgroup Analysis
• Document heterogeneity between subgroups
• Argue consistency with biologic phenomena
• Argue consistency with other data from the
trial
• Argue consistency with other studies
• It is easy to build a story after the fact!
Data from Neonatal Hypocalcemia Trial:
All Calcium Levels in mmol/l
Breast-fed
Supplement
Placebo
Treatment mean
No. babies
SE
Treatment effect
SE
P-value
2.445
64
0.0365
2.408
102
0.0311
0.037
0.0480
0.44
Bottle-fed
Supplement
Placebo
2.300
169
0.0211
2.195
285
0.0189
0.105
0.0283
0.0002
Reference: Cockburn et al, BMJ, 281:11-14; 1980.
See also Pocock. Clinical Trials a Practical Approach..
Data from Neonatal Hypocalcemia Trial
(cont.)
0.037  0.105
0.068
Z

 1.22
2
2
2
2 12
(0.0365  0.0311  0.0211  0.0189 )
0.0557
P-value = 0.22
HDFP Study
Race, Sex, Age
Deaths
SC
RC
Percent Difference
in Mortality
Black men
Black women
White men
White women
112
70
109
58
140
98
126
55
-18.5
-27.8
-14.7
+2.1
30-49
50-59
60-69
81
115
153
82
159
178
-5.7
-25.3
-16.4
Overall
349
419
-16.9
HDFP Subgroups
Black Men (1)
Dead
Alive
SC
112
952
RC
140
944
Black Women (2)
Dead
Alive
SC
70
1274
RC
98
1256
^
O 1 = 0.79
W 1 = 55.0
^
O 2 = 0.70
W 2 = 38.3
White Men (3)
Dead
Alive
SC
109
1783
RC
126
1735
^
O 3 = 0.84
W 3 = 54.8
White Women (4)
Dead
Alive
SC
58
1026
RC
55
1101
^
O 4 = 1.13
W 4 = 26.8
4
 w i  174.9
c 1
log Oˆ p  (55.0)log(0.79)  (38.3)log(0.70)
(54.8)log(0.84)  (26.8)log(1.13)
/174.9
log Oˆ p  0.188
Oˆ p  0.83
4
X (3)   w i (log Oˆ i  log Oˆ p )2
2
c 1
 0.134  1.111 0.008  2.551
 3.804;
p  0.28
Aspirin and Risk of Stroke of Death
Men
Women
Event No Event
Event No Event
Aspirin
29
171
Aspirin
17
73
No
Aspirin
56
150
No
Aspirin
12
77
Among men, aspirin reduced the risk of
stroke or death by 48% (p=0.004); among women, aspirin
increased the risk of stroke or death by 42% (p=0.35).
Overall, aspirin reduced the risk of stroke or death
by 31% (p=0.05).
Conclusion: “We conclude that aspirin is an efficacious
drug for men with threatened stroke”.
N Engl J Med 1978; 299:53-59.
Absolute effects of antiplatelet therapy on vascular events in the 29 trials in high risk
patients with separate information available on each patient subdivided by age and
sex and by diastolic blood pressure and diabetes.
◄No difference
by gender
Antiplatelet Trialists' Collaboration BMJ 1994;308:81-106
©1994 by British Medical Journal Publishing Group
Cox Model for Interaction
• Treatment x gender interaction
Z1 = 1 if eplerenone; 0 if placebo
Z2 = 1 if male; 0 if female
Z 3 = Z 1 x Z2
H0 : β3 = 0
h(t; Z) = h0 (t) exp[β1 Z1 + β2 Z2 + β3 Z3]
Subgroup Analyses According to
Follow-up Time
• Heart and estrogen/progestin
Replacement Study (HERS)
– JAMA 1998; 280: 605-613.
• Adenomatous Polyp Prevention on
Vioxx (APPROVe) Trial
– N Engl J Med 2005; 352:1092-1102
– Lancet 2008; 372:1756-1764.
HERS
EstrogenProgestin
(n=1380)
Placebo
(n=1383)
Primary CHD
events
172
176
Year 1
57
38
1.52
Year 2
47
48
1.00
Year 3
35
41
0.87
Year 4
33
49
0.67
P=.009 for interaction
Hazard Ratio
(95% CI)
0.99
(0.80 – 1.22)
APPROVE
Rofecoxib
(n=1287)
Placebo
(n=1299)
Hazard ratio
(95% CI)
Confirmed
thrombotic
events
46
26
1.92
(1.19 – 3.11)
Months 0-18
22
20
1.18
Months 19-36
24
6
4.45
P=.01 for failed test of proportional hazards (interaction)
Later determined that a different test for interaction was
pre-specified and inclusion of events after treatment
discontinuation changed findings.
Barrett-Connor on HERS*
A Fable: Looking for the Pony
A man has 2 sons, one a hopeless pessimist and
the other an unrealistic optimist. Determined to
change their thinking to a less extreme position,
the man buys a room full of toys for the
pessimist and a room full of horse manure for
the optimist.
When he returns, the pessimist is crying because
he has broken all of his toys. In contrast, the
optimist is shoveling through his gift and
proclaim: “with all that manure there must be a
pony in there somewhere.”
Circulation 2002;105:902-903.
“New Study Reassures Most Users of Hormones.
For Newly Menopausal, There’s No Heart Risk; A
Reversal of Findings.”
“At Issue is something called the P value…”
Wall Street Journal
April 4, 2007
Cardiovascular and Global Index Events by Years
Since Menopause at Baseline (WHI Study)
Years Since Menopause
10-19
<10
No. of Cases
No. of Cases
Hormone
Therapy Placebo HR
(n=3608)(n=3529)(95%CI)
≥20
No. of Cases
Hormone
Therapy Placebo HR
(n=4483) (n=3529) (95%CI)
Hormone
Therapy Placebo HR
(n=3608) (n=3529) (95%CI)
P
value
for
Trend†
CHD‡
39
51
0.76
(0.50-1.16)
113
103
1.10
(0.84-1.45)
194
158
1.28
(1.03-1.58)
.02
Stroke
41
23
1.77
(1.05-2.98)
100
79
1.23
(0.92-1.66)
142
113
1.26
(0.98-1.62)
.36
Total Mortality 53
67
0.76
(0.53-1.09)
142
149
0.98
(0.78-1.24)
267
240
1.14
(0.96-1.36)
.51
Global Index§ 222
203
1.05
(0.86-1.27)
482
440
1.12
(0.98-1.27)
675
632
1.09
(0.98-1.22)
.62
†
Test for trend (interaction) using years since menopause as continuous (linear) form of categorical
coded values. Cox regression models stratified according to active vs. placebo and trial, including
terms for years since menopause and the interaction between trials and years since menopause
JAMA 2007;297:1465-1477
CHD Events by Years Since Menopause at Baseline
Years Since Menopause
<10
CHD‡
10-19
≥20
HR
(95%CI)
HR
(95%CI)
HR
(95%CI)
0.76
(0.50-1.16)
1.10
(0.84-1.45)
1.28
(1.03-1.58)
Pvalue
for
Trend†
“These analyses, although not definitive, suggest that the
health consequences of hormone therapy may vary
by distance from menopause…”
.02
AIDS Vaccine Trial
(Science 28 February 2003)
Vaccine
Placebo
Infected
Not
Infected
191
3,139
98
1,581
289
4,720
5.7% vs. 5.8%
ORˆ  0.98
95% CI (0.78 to 1.24)
3,330
1,679
5,009
AIDS Vaccine Trial
Subgroup Analysis
White and Hispanic
Vaccine
Placebo
Infected
Not
Infected
179
2,824
81
1,427
6.0 vs. 5.4%
ˆ  1.12 (95% CI : 0.85 to 1.46)
OR
1
Black, Asian, Other
Vaccine
Placebo
Infected
Not
Infected
12
315
17
154
3.7 vs. 9.9%
ˆ  0.35 (95% CI : 0.16 to 0.74)
OR
2
ˆ  1.02;  2  8.6 for homogeneity of odds ratio;
O
p
1
p = 0.003
Example: ACTG 155
Arms:
Randomization (allocation ratio)
AZT
2
ddC
2
AZT + ddC
3
Primary outcome: disease progression (AIDS/death)
Secondary outcome: CD4+ cell count change, toxicities
Sample Size: 991
Subgrouping: CD4<50
50≤CD4<150
CD4≥150
Number
269
336
386
“We found no overall benefits of
zalcitabine used alone or with zidovudine.
However, a trend analysis suggested a
better outcome for combination therapy
compared with zidovudine as the
pretreatment CD4 cell count increased”.
“Our study suggests that combination
therapy may be beneficial in patients with
higher CD4 cell counts”.
Pooled Analysis of AZT + ddX vs. AZT
Treatment Naïve Patients
Baseline
CD4+
No. AIDS/Death
Events
< 100
382
0.66
(0.53 - 0.82)
100 - 199
319
0.63
(0.50 - 0.81)
200 - 299
186
0.62
(0.45 - 0.84)
300 - 499
90
0.63
(0.40 - 0.98)
*AZT + ddx vs. AZT
Hazard Ratio*
Some Lessons From ACTG 155 Presentation
1. What does “a priori” mean?
If it is important, amend the protocol.
2. Confusion about stratification and
subgrouping.
Lessons Continued
3. It is easy to develop explanations for possible
subgroup effects.
4. By chance some subgroups will be more
extreme than others.
Lessons Continued
5. For an ordered/continuous variable, test for
trend is important.
CD4+
> 50
50 - 149
150+
4 df test for interaction (3 treatment groups and 3 CD4
categories) or
2 df test (3 treatment groups and continuous CD4)
6. “Subgroup label” may be a marker for
something else.
Guidelines to Follow for Interpreting
Subgroup Analysis
•
Assess magnitude of interaction before focusing on
separate subgroups and their tests of significance
•
Assess consistency with biologic phenomenon
realizing that “human imagination is capable of
developing a rationale for most findings” (Ware,
NEJM, 2003).
•
Assess consistency with other data from trial
•
Assess consistency with other studies
Guidelines For Reporting Subgroup
Analyses (NEJM 2007;2189-2194)
•
Abstract: Only if based on primary outcome and prespecified
•
Methods: Number pre-specified; any of special
interest; endpoint; methods used to assess
heterogeneity; number preformed; potential effect on
type 1 error
•
Results: present tests of heterogeneity; forest plot
•
Discussion: Cautious in interpretation; state
limitations; cite supporting or contradictory data
Criteria Used to Assess Credibility of
Subgroup Effect (BMJ 2012:344:e1553)
• Design
–
–
–
–
Baseline characteristic?
Stratification factor?
A priori specified?
Fewer than 5 subgroups tested?
• Analysis
– Test for interaction performed?
– If multiple interactions, independent?
• Context
–
–
–
–
Direction correctly pre-specified?
Consistent with evidence from previous studies?
Consistent across outcomes?
Indirect evidence (e.g., biologic rationale) supports
finding?
Methods Section of ESPRIT Paper
N Engl J Med 2009; 361: p. 1550
“Data on the primary end point were
summarized for pre-specified subgroups
defined according to baseline characteristics.
A total of 12 subgroup analyses were prespecified. The heterogeneity of hazard-ratio
estimates between subgroups were assessed
by including an interaction term between
treatment and subgroup in expanded Cox
models. The results of subgroup analyses
should be interpreted with caution; a
significant interaction could be due to chance,
because there was no adjustment made to type
1 error for the number of subgroups
examined.”
5 subgroups were reported: age, gender, race/ethnicity,
baseline CD4+ count and baseline HIV RNA level.
Summary
• P-values for individual subgroups are misleading –
report CIs.
• Calculate subgroup by treatment interactions, but be
cognizant of low power and risks of type 1 error if
multiple subgroups are examined
• Keep in mind most trials are designed assuming no
interaction.
• Define key subgroups to be investigated in the
protocol.
• Report subgroup findings very cautiously – ultimately
want validation in another study or meta-analysis.
“Only one thing is worse than doing subgroup
analyses --- believing the results.” Richard Peto
Download