Multiplicity and subset analysis in clinical research: Some practical advice Abstract In spite of sound advice available, from a multitude of reliable and rigourous articles, for limiting or controlling Type I error in clinical research, many studies are conducted without following that advice. This article describes compromises between strict adherence to pre-determined protocols and unlimited data dredging. The article does not propose any specific testing methods but all methods discussed control some overall error rate at a specified level. We propose a continuum of three types of hypothesis tests, with particular error rate allocation in each. If 5% Type I error is desired, then we suggest that researchers dedicate a 5% significance level for testing one/a few primary; then an additional level, possibly 5%, for testing an entire set of pre-designated secondary outcomes. At the third level, any additional outcomes that become important/relevant must be considered unsupported by statistical methods even if those p-values or estimated effect sizes appear to be unusually compelling. Any theory or empirical evidence supporting these additional results should be presented, but the results remain speculative unless supported by additional targeted research. We give examples of methods and results that fit into this continuum. (187) subgroups.draft.24july06 p. 2 Introduction There are many published articles giving advice on how (or whether) to carry out subset analyses, whether planned or unplanned, in clinical research without incurring a large number of Type I errors . In spite of the sound advice given in these articles, many research studies are still conducted without following that advice. This article, recognizing that researchers will explore their data and are anxious to glean as much as possible from the results, some compromises between strict adherence to pre-determined protocols and unlimited data dredging will be described. We suggest that researchers dedicate a 5% significance level for testing a small, targeted group of primary outcomes (perhaps only one), in which subsets usually won't play a part, and an additional level, possibly 5%, for testing an entire set of pre-designated secondary outcomes. Beyond these, any additional outcomes that draw attention, either because of very small pvalues or large estimated effect sizes, should be reported as speculation, unsupported by good statistical methods even if those p-values or estimated effect sizes appear to be unusually compelling. Any theory or empirical evidence supporting these additional results should be presented, but the results remain speculative unless supported by additional targeted research. The article does not propose any specific testing methods; the only requirement is that methods control some overall error rate at a specified level. (How describe Bayesian methods here?) Within the targeted and the additional group, the error probability should be controlled at the designated level. This may or may not be describable in terms of error allocation, and may even be considered exploratory, for example, looking at all pairwise differences among a set of subgroups. As long as the method chosen controls the chosen overall error rate, this would be described as being in the second, predesignated set (for example, with all pairwise comparisons, a method based on the studentized range might be chosen if the necessary assumptions were supported). If secondary analysis of a data set is carried out, previous findings must be considered. Reanalysis of the main findings should be based on the primary and secondary endpoints originally proposed. New formulations have been informed by the original findings, and should be treated as if they are in this third, unplanned category unless compelling reasons are given to treat them otherwise. The nominal alpha level at which secondary outcomes are tested should be 2 subgroups.draft.24july06 p. 3 at most the original level. One advantage of using the original level is that if some secondary outcomes are very highly correlated with the original ones, there is little loss of power in using that level, assuming appropriate consideration is given to the correlations within the secondary class. Subgroups may be based on prespecified categories of special interest (e.g. sex, age, race), or on examination of outcomes (see 7d below).) Examples: We need examples for most points. Especially important are examples of this third group where there are very extreme p values but subsequent research has not supported the result. On effect sizes: these are often not reported. If they are, we can include large apparent effect sizes that have not been replicated. Since these analyses are often based on very small sample sizes, effect sizes may be even more unreliable than pvalues. Specific issues: I left these since I think they might be reasonable additional points. I would add to them: Use raw data rather than aggregated data as much as possible; binning data (putting them into intervals) can change apparent relationships (Wainer reference). Also, I mentioned Lord's paradox below. In view of the Wainer-Brown paper, we might also mention Simpson's and Kelley's paradoxes. These don't affect methodology as much as causing problems of interpretation and explanation. 1. No real difference between allowing for pre-specified outcomes and allowing for discovered outcomes: Multiplicity issues affect both. The difference lies in the difficulty of specifying the family of inferences in the latter case. One may notice something that had not been anticipated in any way. It might be possible to estimate a family size retrospectively. For example, in a multifactor analysis of variance design, a two-way interaction may unexpectedly turn out to be unusually large - this might be evaluated sing the family of all main effects and two-way interactions. However, such retrospective consideration is highly suspect, since it is unclear what other effects might have been considered. 2. Pre-specify a small number (perhaps only one) important outcome and allow maximum Type I error (usually set at .05) among these. 3. Allow some maximum Type I error (possibly the same value as in 2) for all other analyses that have been planned for. Specification of family size and analytic methods are crucial here. Control other than familywise error 3 subgroups.draft.24july06 p. 4 control (e.g. FDR) may be considered. Consider Bayesian analysis if appropriate. This family may include some analyses often called exploratory, such as all pairwise comparisons (Tukey methods) or all contrasts (Scheffe methods) among subgroups. 4. Consider hierarchical analyses-- Not going to a further step unless a test at an earlier step is significant--to gain power. In general, it is not always appropriate to speak of "allocating" the error within the primary and secondary groups, since even with familywise error control, overall control is often achieved without dividing the total allowable error into subsets (e.g. stepwise methods can't be looked at as allocation). 5. Results from 3 and 4 might be labelled promising but unreliable. 6. If any further exploration is carried out, label any findings purely exploratory, highly speculative, with no good statistical evidence to back them up, because the set of outcomes considered is usually impossible to specify. (My feeling is that if people are going to do this anyway, it would be better if at least they label their findings in this way.) 7. Some more specific points: (a) In interaction analyses, consider testing for qualitative rather than only quantitative interaction. It is much less important to note differences among subgroup outcomes if they are all either positive or negative than to discover that outcomes differ in direction among subgroups. (b) Use nonparametric methods (permutation, bootstrap) where possible to avoid parametric assumptions unless there is support for the latter. (c) Covariance adjustment with unreliable or randomly-fluctuating covariates may be misleading - see Lord's paradox and the related Kelley's paradox (Wainer reference). (d) In analyses under (3) or (6), consider starting with outcome groups and finding best predictors via regression analysis or some type of splitting method (e.g. CART), with attention to error control and avoidance of overfitting. (e) In an analysis of variance design, steps 3 or 4 might be accomplished using Tukey or Scheffe type analyses, as noted, or an intermediate approach 4 subgroups.draft.24july06 p. 5 allowing all unweighted combinations of averages of one set of subgroup means minus averages of another set (Klockars and Hancock (1998), A more powerful post hoc comparison procedure in analysis of variance, J. of Educational and Behavioral Statistics, 23, 279-289.). (f) Consider testing for differences in distributions of subgroups other than mean differences. Choice of tests and interpretation of results depend on the purpose of testing and the expected differences among subgroups. Finally, we should realize that even with the most carefully designed studies, we must expect that some results will not be replicated. Some results are inevitably going to be false positives, and the proportion of these depends on the proportion of studies in which there is really no effect. In the worst case, suppose virtually all the treatments being evaluated in studies are ineffective. Then virtually all the positive findings will be Type I errors. In independent replication of these studies, only a proportion alpha will be expected to show significant results. Things are surely not that bad, and designed studies are usually carried out only when there is some independent reason to think that they are dealing with real effects, but the sobering fact is that we can't expect the results of any one study to be definitive, and there is an unknown limit to the number of studies we can expect to be supported in replications. 5 subgroups.draft.24july06 p. 6 REFERENCES PUBMED search, Monday 17 July 2006 “subgroup analysis” Field: Title/Abstract, Limits: English, Review, Male, Female, Humans, Core clinical journals 1: Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002 Oct 15;21(19):2917-30. 2: Assmann SF, Pocock SJ, Enos LE, Kasten LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet. 2000 Mar 25;355(9209):1064-9. 3: Higashida RT, Furlan AJ, Roberts H, Tomsick T, Connors B, Barr J, Dillon W, Warach S, Broderick J, Tilley B, Sacks D; Technology Assessment Committee of the American Society of Interventional and Therapeutic Neuroradiology; Technology Assessment Committee of the Society of Interventional Radiology. Trial design and reporting standards for intra-arterial cerebral thrombolysis for acute ischemic stroke. Stroke. 2003 Aug;34(8):e109-37. Epub 2003 Jul 17. Erratum in: Stroke. 2003 Nov;34(11):2774. 4: Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med. 1987 Aug 13;317(7):426-32. 5: Moreira ED Jr, Stein Z, Susser E. Reporting on methods of subgroup analysis in clinical trials: a survey of four scientific journals. Braz J Med Biol Res. 2001 Nov;34(11):1441-6. 6: Schulz KF, Chalmers I, Grimes DA, Altman DG. Assessing the quality of randomization from reports of controlled trials published in obstetrics and gynecology journals. JAMA. 1994 Jul 13;272(2):125-8. 7: Hernandez AV, Steyerberg EW, Habbema JD. Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements. J Clin Epidemiol. 2004 May;57(5):454-60. 8: Canner PL. Covariate adjustment of treatment effects in clinical trials. Control Clin Trials. 1991 Jun;12(3):359-66. 9: Grant AM, Altman DG, Babiker AB, Campbell MK, Clemens FJ, Darbyshire JH, Elbourne DR, McLeer SK, Parmar MK, Pocock SJ, Spiegelhalter DJ, Sydes MR, Walker AE, Wallace SA; DAMOCLES study group. Issues in data monitoring and interim analysis of trials. Health Technol Assess. 2005 Mar;9(7):1-238, iii-iv. Review. 10: Bhandari M, Devereaux PJ, Li P, Mah D, Lim K, Schunemann HJ, Tornetta P3rd. Misuse of baseline comparison tests and subgroup analyses in surgical trials. Clin Orthop Relat Res. 2006 Jun;447:247-51. 6 subgroups.draft.24july06 p. 7 11: Tangen CM, Koch GG. Non-parametric analysis of covariance for confirmatory randomized clinicaltrials to evaluate doseresponse relationships. Stat Med. 2001 Sep 15-30;20(17-18):2585-607. 12: Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA. 1991 Jul 3;266(1):93-8. 13: Wang SJ, Hung HM. Adaptive covariate adjustment in clinical trials. J Biopharm Stat. 2005;15(4):605-11. 14: Rothwell PM. Treating individuals 2. Subgroup analysis in randomised controlled trials:importance, indications, and interpretation. Lancet. 2005 Jan 8-14;365(9454):176-86. 15: Grouin JM, Coste M, Lewis J. Subgroup analyses in randomized clinical trials: statistical and regulatory issues. J Biopharm Stat. 2005;15(5):869-82. 16: Lesaffre E, Bogaerts K, Li X, Bluhmki E. On the variability of covariate adjustment. experience with Koch's method for evaluating the absolute difference in proportions in randomized clinical trials. Control Clin Trials. 2002 Apr;23(2):127-42. 17: DeMets DL. Statistical issues in interpreting clinical trials. J Intern Med. 2004 May;255(5):529-37. Review. 18: Laopaiboon M. Meta-analyses involving cluster randomization trials: a review of published literature in health care. Stat Methods Med Res. 2003 Dec;12(6):515-30. Review. 19: Grundy SM, Cleeman JI, Merz CN, Brewer HB Jr, Clark LT, Hunninghake DB, Pasternak RC, Smith SC Jr, Stone NJ; Coordinating Committee of the National Cholesterol Education Program. Implications of recent clinical trials for the National Cholesterol Education Program Adult Treatment Panel III Guidelines. J Am Coll Cardiol. 2004 Aug 4;44(3):720-32. Review. 20: Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, Peters TJ. Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. J Clin Epidemiol. 2004 Mar;57(3):229-36. Klockars AJ and Hancock GR.(1998). A more powerful post hoc comparison procedure in analysis of variance. J. of Educational and Behavioral Statistics, 23, 279-289. 7