Subgroup Analysis “Fun to look at but don’t believe them!” (P Sleight, 2000) Deciding on analysis after looking at the data is “dangerous, useful, and often done.” (IJ Good, 1983) “We cannot necessarily, perhaps very rarely, pass from (the overall result of a trial) to stating exactly what effect the treatment will have on a particular patient. But there is surely, no way and no method of deciding that.” A. Bradford Hill, 1952 “Several factors are threatening to create a path to “depersonalized” medicine despite advances both in fundamental science and clinical therapeutics. The tendency to focus on statistics for the group rather than the individual clinical features of the patient is one factor.” Horwitz RI, et al, (De)Personalized Medicine, Science, 8 March 2013 Most trials report subgroup analyses (median=4 subgroups) Assmann SF, Lancet 2000; 355:1064-1069 Influence of Study Characteristics on Reporting of Subgroups • 44% of 469 randomized trials published in major journals reported subgroup analyses • Subgroup analyses were more likely to be reported in high impact journals, non-surgical trials, and large trials. • There was an interaction between source of funding and reporting of subgroups in trials without significant overall results – a subgroup finding! Sun et al, BMJ, 2011 Conclusion: Sun et al, BMJ 2011 “Industry funded randomised controlled trials, in the absence of statistically significant primary outcomes, are more likely to report subgroup analyses than non-industry funded trials. Industry funded trials less frequently test for interaction than non-industry funded trials. Subgroup analyses from industry funded trials with negative results for the primary outcome should be viewed with caution.” Aims of Subgroup Analysis • To show consistency of trial findings for major endpoints for important patient subsets • To assess whether there are large differences in the treatment effect among different types of patients and, if so, identify hypotheses for future research. (Assess the possibility of treatment X subgroup or covariate interactions) Aim should not be to salvage a trial for which the overall results were not as hoped for! Subgroup Analysis by Astrological Birth Sign ISIS-2: Streptokinase and Aspirin for Acute MI Percentage Reduction in 5 Week Vascular Mortality Gemini or Libra 9% (NS) Other signs -28% (p < 0.00001) Overall -23% (p < 0.00001) “Lack of evidence of benefit just in one particular subgroup is not good evidence of lack of benefit.” Subgrouping Considerations • Most trials are not designed to look at subgroups; sample size and power based on overall treatment effect (power is lower for subgroups than overall comparison). • For subgroup analysis, it is often not clear how to control for type 1 error (the more subgroups examined, the greater the risk of a type 1 error). • Not all subgroups of interest can be pre-specified (we are not that smart). • The subgroup may not be what it appears to be (it may be a marker or label for some other characteristic). Subgroup Definitions • Proper subgroup – grouping of patients according to baseline characteristics • Improper subgroup – grouping of patients according to characteristics following randomization (i.e., factors potentially affected by treatment) • Interaction – evidence that treatment effects differ by subgroup (quantitative versus qualitative) Yusuf S, et al., JAMA, 266:93-98, 1991. A Priori and A Posteriori Subgroups • A priori: written in the protocol in advance of the study (hypothesis driven) • A posteriori (post hoc or exploratory): - specified … later - before unblinding - after unblinding • Both have inflated error rates, but more of a problem with a posteriori defined subgroups. INSIGHT START Protocol: Early Treatment for HIV “Subgroup analyses for the primary endpoint and major secondary outcomes will be performed to determine whether the treatment effect (early versus deferred) differs qualitatively across various baseline-defined subgroups. Subgroup analysis will be performed by age, gender, race/ethnicity, geographic region, the presence of risk factors for serious non-AIDS conditions, baseline CD4+ cell count, baseline HIV RNA level, calendar date of enrollment to assess the effect of different treatment patterns that may emerge, and the ART-regimen pre-specified at the time of randomization….An overall test of heterogeneity will provide evidence of whether the magnitude of the treatment difference varies across baseline subgroups.” Pre- and Post-Stratification and Subgroup Analysis • Pre-stratification variables are often, but not always, subgroups of interest. • Aim of post-stratified analysis is to obtain a “better” estimate of overall treatment effect. • Aim of subgroup analysis is to determine whether treatment differences are consistent. • Like post-stratification, plans for subgroup analysis should be pre-specified –– sometimes there are surprises. Subgrouping vs. Stratification Grouping Pre-stratification Purpose “insurance” for balance in randomization Post-stratification increase the accuracy of estimates of treatment effect Subgroups check the consistency of the treatment effect Stratified Design for Comparing Treatments Treatment Stratum A B 1 m1A m1B m1 2 m2A m2B m2 3 m3A m3B m3 4 m4A m4B m4 na nb • Typical situation: m1 ≠ m2 ≠ m3 ≠ m4 • Study is designed/powered based on na and nb • Goal: miA = miB for all i. Subgrouping Factors Determined Experimentally 2 x 2 Factorial Determined by Randomization { A No A A No A B No B versus Baseline Characteristic { B No B NIH Policy on Subgroups “When an NIH-defined Phase III clinical trial is proposed, evidence must be reviewed to show whether or not clinically important sex/gender and race/ethnicity differences in the intervention effect are to be expected.” “Inclusion of the results of sex/gender, race/ethnicity and relevant subpopulations analyses is strongly encouraged in all publication submissions.” http://grants.nih.gov/grants/funding/women_min/guidelines_amended_10_2001,htm ICH Guidelines on Subgroups • If the size of the study permits, important demographic or baseline value-defined subgroups should be examined. • These analyses are not intended to “salvage” an otherwise unsupportive study. • Subgroup analyses may suggest hypotheses to be examined in other studies • If there is a prior hypothesis about a subgroup, this should be part of the statistical analysis plan. Issues to Consider • Appropriate significance level? Bonferroni method may be too conservative – loss of power in a situation where power is already low. • Should subgroup analysis be performed if the overall result is negative? Much harder sell. • Should only a priori subgroups be described? Not always that smart. • How should subgroup analyses be presented? Interaction tests important. • Should analyses be based on post-randomization measures? No A Consumer’s (and Producer’s?) Guide to Subgroup Analysis • Document heterogeneity between subgroups • Argue consistency with biologic phenomena • Argue consistency with other data from the trial • Argue consistency with other studies • It is easy to build a story after the fact! Data from Neonatal Hypocalcemia Trial: All Calcium Levels in mmol/l Breast-fed Supplement Placebo Treatment mean No. babies SE Treatment effect SE P-value 2.445 64 0.0365 2.408 102 0.0311 0.037 0.0480 0.44 Bottle-fed Supplement Placebo 2.300 169 0.0211 2.195 285 0.0189 0.105 0.0283 0.0002 Reference: Cockburn et al, BMJ, 281:11-14; 1980. See also Pocock. Clinical Trials a Practical Approach.. Data from Neonatal Hypocalcemia Trial (cont.) 0.037 0.105 0.068 Z 1.22 2 2 2 2 12 (0.0365 0.0311 0.0211 0.0189 ) 0.0557 P-value = 0.22 HDFP Study Race, Sex, Age Deaths SC RC Percent Difference in Mortality Black men Black women White men White women 112 70 109 58 140 98 126 55 -18.5 -27.8 -14.7 +2.1 30-49 50-59 60-69 81 115 153 82 159 178 -5.7 -25.3 -16.4 Overall 349 419 -16.9 HDFP Subgroups Black Men (1) Dead Alive SC 112 952 RC 140 944 Black Women (2) Dead Alive SC 70 1274 RC 98 1256 ^ O 1 = 0.79 W 1 = 55.0 ^ O 2 = 0.70 W 2 = 38.3 White Men (3) Dead Alive SC 109 1783 RC 126 1735 ^ O 3 = 0.84 W 3 = 54.8 White Women (4) Dead Alive SC 58 1026 RC 55 1101 ^ O 4 = 1.13 W 4 = 26.8 4 w i 174.9 c 1 log Oˆ p (55.0)log(0.79) (38.3)log(0.70) (54.8)log(0.84) (26.8)log(1.13) /174.9 log Oˆ p 0.188 Oˆ p 0.83 4 X (3) w i (log Oˆ i log Oˆ p )2 2 c 1 0.134 1.111 0.008 2.551 3.804; p 0.28 Aspirin and Risk of Stroke of Death Men Women Event No Event Event No Event Aspirin 29 171 Aspirin 17 73 No Aspirin 56 150 No Aspirin 12 77 Among men, aspirin reduced the risk of stroke or death by 48% (p=0.004); among women, aspirin increased the risk of stroke or death by 42% (p=0.35). Overall, aspirin reduced the risk of stroke or death by 31% (p=0.05). Conclusion: “We conclude that aspirin is an efficacious drug for men with threatened stroke”. N Engl J Med 1978; 299:53-59. Absolute effects of antiplatelet therapy on vascular events in the 29 trials in high risk patients with separate information available on each patient subdivided by age and sex and by diastolic blood pressure and diabetes. ◄No difference by gender Antiplatelet Trialists' Collaboration BMJ 1994;308:81-106 ©1994 by British Medical Journal Publishing Group Cox Model for Interaction • Treatment x gender interaction Z1 = 1 if eplerenone; 0 if placebo Z2 = 1 if male; 0 if female Z 3 = Z 1 x Z2 H0 : β3 = 0 h(t; Z) = h0 (t) exp[β1 Z1 + β2 Z2 + β3 Z3] Subgroup Analyses According to Follow-up Time • Heart and estrogen/progestin Replacement Study (HERS) – JAMA 1998; 280: 605-613. • Adenomatous Polyp Prevention on Vioxx (APPROVe) Trial – N Engl J Med 2005; 352:1092-1102 – Lancet 2008; 372:1756-1764. HERS EstrogenProgestin (n=1380) Placebo (n=1383) Primary CHD events 172 176 Year 1 57 38 1.52 Year 2 47 48 1.00 Year 3 35 41 0.87 Year 4 33 49 0.67 P=.009 for interaction Hazard Ratio (95% CI) 0.99 (0.80 – 1.22) APPROVE Rofecoxib (n=1287) Placebo (n=1299) Hazard ratio (95% CI) Confirmed thrombotic events 46 26 1.92 (1.19 – 3.11) Months 0-18 22 20 1.18 Months 19-36 24 6 4.45 P=.01 for failed test of proportional hazards (interaction) Later determined that a different test for interaction was pre-specified and inclusion of events after treatment discontinuation changed findings. Barrett-Connor on HERS* A Fable: Looking for the Pony A man has 2 sons, one a hopeless pessimist and the other an unrealistic optimist. Determined to change their thinking to a less extreme position, the man buys a room full of toys for the pessimist and a room full of horse manure for the optimist. When he returns, the pessimist is crying because he has broken all of his toys. In contrast, the optimist is shoveling through his gift and proclaim: “with all that manure there must be a pony in there somewhere.” Circulation 2002;105:902-903. “New Study Reassures Most Users of Hormones. For Newly Menopausal, There’s No Heart Risk; A Reversal of Findings.” “At Issue is something called the P value…” Wall Street Journal April 4, 2007 Cardiovascular and Global Index Events by Years Since Menopause at Baseline (WHI Study) Years Since Menopause 10-19 <10 No. of Cases No. of Cases Hormone Therapy Placebo HR (n=3608)(n=3529)(95%CI) ≥20 No. of Cases Hormone Therapy Placebo HR (n=4483) (n=3529) (95%CI) Hormone Therapy Placebo HR (n=3608) (n=3529) (95%CI) P value for Trend† CHD‡ 39 51 0.76 (0.50-1.16) 113 103 1.10 (0.84-1.45) 194 158 1.28 (1.03-1.58) .02 Stroke 41 23 1.77 (1.05-2.98) 100 79 1.23 (0.92-1.66) 142 113 1.26 (0.98-1.62) .36 Total Mortality 53 67 0.76 (0.53-1.09) 142 149 0.98 (0.78-1.24) 267 240 1.14 (0.96-1.36) .51 Global Index§ 222 203 1.05 (0.86-1.27) 482 440 1.12 (0.98-1.27) 675 632 1.09 (0.98-1.22) .62 † Test for trend (interaction) using years since menopause as continuous (linear) form of categorical coded values. Cox regression models stratified according to active vs. placebo and trial, including terms for years since menopause and the interaction between trials and years since menopause JAMA 2007;297:1465-1477 CHD Events by Years Since Menopause at Baseline Years Since Menopause <10 CHD‡ 10-19 ≥20 HR (95%CI) HR (95%CI) HR (95%CI) 0.76 (0.50-1.16) 1.10 (0.84-1.45) 1.28 (1.03-1.58) Pvalue for Trend† “These analyses, although not definitive, suggest that the health consequences of hormone therapy may vary by distance from menopause…” .02 AIDS Vaccine Trial (Science 28 February 2003) Vaccine Placebo Infected Not Infected 191 3,139 98 1,581 289 4,720 5.7% vs. 5.8% ORˆ 0.98 95% CI (0.78 to 1.24) 3,330 1,679 5,009 AIDS Vaccine Trial Subgroup Analysis White and Hispanic Vaccine Placebo Infected Not Infected 179 2,824 81 1,427 6.0 vs. 5.4% ˆ 1.12 (95% CI : 0.85 to 1.46) OR 1 Black, Asian, Other Vaccine Placebo Infected Not Infected 12 315 17 154 3.7 vs. 9.9% ˆ 0.35 (95% CI : 0.16 to 0.74) OR 2 ˆ 1.02; 2 8.6 for homogeneity of odds ratio; O p 1 p = 0.003 Example: ACTG 155 Arms: Randomization (allocation ratio) AZT 2 ddC 2 AZT + ddC 3 Primary outcome: disease progression (AIDS/death) Secondary outcome: CD4+ cell count change, toxicities Sample Size: 991 Subgrouping: CD4<50 50≤CD4<150 CD4≥150 Number 269 336 386 “We found no overall benefits of zalcitabine used alone or with zidovudine. However, a trend analysis suggested a better outcome for combination therapy compared with zidovudine as the pretreatment CD4 cell count increased”. “Our study suggests that combination therapy may be beneficial in patients with higher CD4 cell counts”. Pooled Analysis of AZT + ddX vs. AZT Treatment Naïve Patients Baseline CD4+ No. AIDS/Death Events < 100 382 0.66 (0.53 - 0.82) 100 - 199 319 0.63 (0.50 - 0.81) 200 - 299 186 0.62 (0.45 - 0.84) 300 - 499 90 0.63 (0.40 - 0.98) *AZT + ddx vs. AZT Hazard Ratio* Some Lessons From ACTG 155 Presentation 1. What does “a priori” mean? If it is important, amend the protocol. 2. Confusion about stratification and subgrouping. Lessons Continued 3. It is easy to develop explanations for possible subgroup effects. 4. By chance some subgroups will be more extreme than others. Lessons Continued 5. For an ordered/continuous variable, test for trend is important. CD4+ > 50 50 - 149 150+ 4 df test for interaction (3 treatment groups and 3 CD4 categories) or 2 df test (3 treatment groups and continuous CD4) 6. “Subgroup label” may be a marker for something else. Guidelines to Follow for Interpreting Subgroup Analysis • Assess magnitude of interaction before focusing on separate subgroups and their tests of significance • Assess consistency with biologic phenomenon realizing that “human imagination is capable of developing a rationale for most findings” (Ware, NEJM, 2003). • Assess consistency with other data from trial • Assess consistency with other studies Guidelines For Reporting Subgroup Analyses (NEJM 2007;2189-2194) • Abstract: Only if based on primary outcome and prespecified • Methods: Number pre-specified; any of special interest; endpoint; methods used to assess heterogeneity; number preformed; potential effect on type 1 error • Results: present tests of heterogeneity; forest plot • Discussion: Cautious in interpretation; state limitations; cite supporting or contradictory data Criteria Used to Assess Credibility of Subgroup Effect (BMJ 2012:344:e1553) • Design – – – – Baseline characteristic? Stratification factor? A priori specified? Fewer than 5 subgroups tested? • Analysis – Test for interaction performed? – If multiple interactions, independent? • Context – – – – Direction correctly pre-specified? Consistent with evidence from previous studies? Consistent across outcomes? Indirect evidence (e.g., biologic rationale) supports finding? Methods Section of ESPRIT Paper N Engl J Med 2009; 361: p. 1550 “Data on the primary end point were summarized for pre-specified subgroups defined according to baseline characteristics. A total of 12 subgroup analyses were prespecified. The heterogeneity of hazard-ratio estimates between subgroups were assessed by including an interaction term between treatment and subgroup in expanded Cox models. The results of subgroup analyses should be interpreted with caution; a significant interaction could be due to chance, because there was no adjustment made to type 1 error for the number of subgroups examined.” 5 subgroups were reported: age, gender, race/ethnicity, baseline CD4+ count and baseline HIV RNA level. Summary • P-values for individual subgroups are misleading – report CIs. • Calculate subgroup by treatment interactions, but be cognizant of low power and risks of type 1 error if multiple subgroups are examined • Keep in mind most trials are designed assuming no interaction. • Define key subgroups to be investigated in the protocol. • Report subgroup findings very cautiously – ultimately want validation in another study or meta-analysis. “Only one thing is worse than doing subgroup analyses --- believing the results.” Richard Peto