doc

advertisement
Multiplicity and subset analysis in clinical research: Some practical advice
Abstract
In spite of sound advice available, from a multitude of reliable and rigourous
articles, for limiting or controlling Type I error in clinical research, many
studies are conducted without following that advice. This article describes
compromises between strict adherence to pre-determined protocols and
unlimited data dredging. The article does not propose any specific testing
methods but all methods discussed control some overall error rate at a
specified level. We propose a continuum of three types of hypothesis tests,
with particular error rate allocation in each. If 5% Type I error is desired,
then we suggest that researchers dedicate a 5% significance level for testing
one/a few primary; then an additional level, possibly 5%, for testing an
entire set of pre-designated secondary outcomes. At the third level, any
additional outcomes that become important/relevant must be considered
unsupported by statistical methods even if those p-values or estimated
effect sizes appear to be unusually compelling. Any theory or empirical
evidence supporting these additional results should be presented, but the
results remain speculative unless supported by additional targeted research.
We give examples of methods and results that fit into this continuum.
(187)
subgroups.draft.24july06 p. 2
Introduction
There are many published articles giving advice on how (or whether) to
carry out subset analyses, whether planned or unplanned, in clinical
research without incurring a large number of Type I errors . In spite of the
sound advice given in these articles, many research studies are still
conducted without following that advice. This article, recognizing that
researchers will explore their data and are anxious to glean as much as
possible from the results, some compromises between strict adherence to
pre-determined protocols and unlimited
data dredging will be described.
We suggest that researchers dedicate a 5% significance level for testing a
small, targeted group of primary outcomes (perhaps only one), in which
subsets usually won't play a part, and an additional level, possibly 5%, for
testing an entire set of pre-designated secondary outcomes. Beyond these,
any additional outcomes that draw attention, either because of very small pvalues or large estimated effect sizes, should be reported as speculation,
unsupported by good statistical methods even if those p-values or estimated
effect sizes appear to be unusually compelling. Any theory or empirical
evidence supporting these additional results should be presented, but the
results remain speculative unless supported by additional
targeted research.
The article does not propose any specific testing methods; the only
requirement is that methods control some overall error rate at a specified
level. (How describe Bayesian methods here?) Within the targeted and the
additional group, the error probability should be controlled at the designated
level. This may or may not be describable in terms of error allocation, and
may even be considered exploratory, for example, looking at all pairwise
differences among a set of subgroups. As long as the method chosen
controls the chosen overall error rate, this would be described as being in
the second, predesignated set (for example, with all pairwise comparisons, a
method based on the studentized range might be chosen if the necessary
assumptions were supported).
If secondary analysis of a data set is carried out, previous findings must be
considered. Reanalysis of the main findings should be based on the primary
and secondary endpoints originally proposed. New formulations have been
informed by the original findings, and should be treated as if they are in this
third, unplanned category unless compelling reasons are given to treat them
otherwise.
The nominal alpha level at which secondary outcomes are tested should be
2
subgroups.draft.24july06 p. 3
at most the original level. One advantage of using the original level is that if
some secondary outcomes are very highly correlated with the original ones,
there is little loss of power in using that level, assuming appropriate
consideration is given to the correlations within the secondary class.
Subgroups may be based on prespecified categories of special interest
(e.g. sex, age, race), or on examination of outcomes (see 7d below).)
Examples: We need examples for most points. Especially important are
examples of this third group where there are very extreme p values but
subsequent research has not supported the result. On effect sizes: these
are often not reported. If they are, we can include large apparent effect
sizes that have not been replicated. Since these analyses are often based
on very small sample sizes, effect sizes may be even more unreliable than pvalues.
Specific issues: I left these since I think they might be reasonable
additional points. I would add to them: Use raw data rather
than aggregated data as much as possible; binning data (putting them
into intervals) can change apparent relationships (Wainer reference).
Also, I mentioned Lord's paradox below. In view of the Wainer-Brown
paper, we might also mention Simpson's and Kelley's paradoxes. These
don't affect methodology as much as causing problems of interpretation and
explanation.
1. No real difference between allowing for pre-specified outcomes and
allowing for discovered outcomes: Multiplicity issues affect both. The
difference lies in the difficulty of specifying the family of inferences in the
latter case.
One may notice something that had not been anticipated in any way. It
might be possible to estimate a family size retrospectively. For example, in
a multifactor analysis of variance design, a two-way
interaction may unexpectedly turn out to be unusually large - this might be
evaluated sing the family of all main effects and two-way interactions.
However, such retrospective consideration is highly suspect, since it is
unclear what other effects might have been considered.
2. Pre-specify a small number (perhaps only one) important outcome
and allow maximum Type I error (usually set at .05) among these.
3. Allow some maximum Type I error (possibly the same value as in 2) for
all other analyses that have been planned for. Specification of family size
and analytic methods are crucial here. Control other than familywise error
3
subgroups.draft.24july06 p. 4
control (e.g. FDR) may be considered. Consider Bayesian analysis if
appropriate. This family may include some analyses often called
exploratory, such as all pairwise comparisons (Tukey methods) or all
contrasts (Scheffe methods) among subgroups.
4. Consider hierarchical analyses-- Not going to a further step unless a test
at an earlier step is significant--to gain power. In general, it is
not always appropriate to speak of "allocating" the error within the primary
and secondary groups, since even with familywise error control, overall
control is often achieved without dividing the total allowable error into
subsets (e.g. stepwise methods can't be looked at as allocation).
5. Results from 3 and 4 might be labelled promising but unreliable.
6. If any further exploration is carried out, label any findings
purely exploratory, highly speculative, with no good statistical evidence
to back them up, because the set of outcomes considered is usually
impossible
to specify.
(My feeling is that if people are going to do this anyway, it would be
better if at least they label their findings in this way.)
7. Some more specific points:
(a) In interaction analyses, consider testing for qualitative rather than only
quantitative interaction. It is much less important to note differences among
subgroup outcomes if they are all either positive or negative than to discover
that outcomes differ in direction among subgroups.
(b) Use nonparametric methods (permutation, bootstrap) where possible to
avoid parametric assumptions unless there is support for the latter.
(c) Covariance adjustment with unreliable or randomly-fluctuating
covariates may be misleading - see Lord's paradox and the related Kelley's
paradox (Wainer reference).
(d) In analyses under (3) or (6), consider starting with outcome groups and
finding best predictors via regression analysis or some type of splitting
method (e.g. CART), with attention to error control and avoidance of
overfitting.
(e) In an analysis of variance design, steps 3 or 4 might be accomplished
using Tukey or Scheffe type analyses, as noted, or an intermediate approach
4
subgroups.draft.24july06 p. 5
allowing all unweighted combinations of averages of one set of subgroup
means minus averages of another set
(Klockars and Hancock (1998), A more powerful post hoc comparison
procedure in analysis of variance, J. of Educational and Behavioral
Statistics, 23, 279-289.).
(f) Consider testing for differences in distributions of subgroups other than
mean differences. Choice of tests and interpretation of results depend on
the purpose of testing and the expected differences among subgroups.
Finally, we should realize that even with the most carefully designed studies,
we must expect that some results will not be replicated. Some results are
inevitably going to be false positives, and the proportion of these depends
on the proportion of studies in which there is really no effect. In the worst
case, suppose virtually all the treatments being evaluated in studies are
ineffective.
Then virtually all the positive findings will be Type I errors. In independent
replication of these studies, only a proportion alpha will be expected to show
significant results. Things are surely not that bad, and designed studies are
usually carried out only when there is some independent reason to think that
they are dealing with real effects, but the sobering fact is that we can't
expect the results of any one study to be definitive, and there is an unknown
limit to the number of studies we can expect to be supported in replications.
5
subgroups.draft.24july06 p. 6
REFERENCES
PUBMED search, Monday 17 July 2006 “subgroup analysis”
Field: Title/Abstract, Limits: English, Review, Male, Female, Humans, Core clinical journals
1: Pocock SJ, Assmann SE, Enos LE, Kasten LE.
Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current
practice and problems.
Stat Med. 2002 Oct 15;21(19):2917-30.
2: Assmann SF, Pocock SJ, Enos LE, Kasten LE.
Subgroup analysis and other (mis)uses of baseline data in clinical trials.
Lancet. 2000 Mar 25;355(9209):1064-9.
3: Higashida RT, Furlan AJ, Roberts H, Tomsick T, Connors B, Barr J, Dillon W,
Warach S, Broderick J, Tilley B, Sacks D; Technology Assessment Committee of the
American Society of Interventional and Therapeutic Neuroradiology; Technology
Assessment Committee of the Society of Interventional Radiology.
Trial design and reporting standards for intra-arterial cerebral thrombolysis for acute ischemic stroke.
Stroke. 2003 Aug;34(8):e109-37. Epub 2003 Jul 17. Erratum in: Stroke. 2003 Nov;34(11):2774.
4: Pocock SJ, Hughes MD, Lee RJ.
Statistical problems in the reporting of clinical trials. A survey of three medical journals.
N Engl J Med. 1987 Aug 13;317(7):426-32.
5: Moreira ED Jr, Stein Z, Susser E.
Reporting on methods of subgroup analysis in clinical trials: a survey of four scientific journals.
Braz J Med Biol Res. 2001 Nov;34(11):1441-6.
6: Schulz KF, Chalmers I, Grimes DA, Altman DG.
Assessing the quality of randomization from reports of controlled trials published in obstetrics and
gynecology journals.
JAMA. 1994 Jul 13;272(2):125-8.
7: Hernandez AV, Steyerberg EW, Habbema JD.
Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical
power and reduces sample size requirements.
J Clin Epidemiol. 2004 May;57(5):454-60.
8: Canner PL.
Covariate adjustment of treatment effects in clinical trials.
Control Clin Trials. 1991 Jun;12(3):359-66.
9: Grant AM, Altman DG, Babiker AB, Campbell MK, Clemens FJ, Darbyshire JH,
Elbourne DR, McLeer SK, Parmar MK, Pocock SJ, Spiegelhalter DJ, Sydes MR, Walker
AE, Wallace SA; DAMOCLES study group.
Issues in data monitoring and interim analysis of trials.
Health Technol Assess. 2005 Mar;9(7):1-238, iii-iv. Review.
10: Bhandari M, Devereaux PJ, Li P, Mah D, Lim K, Schunemann HJ, Tornetta P3rd.
Misuse of baseline comparison tests and subgroup analyses in surgical trials.
Clin Orthop Relat Res. 2006 Jun;447:247-51.
6
subgroups.draft.24july06 p. 7
11: Tangen CM, Koch GG.
Non-parametric analysis of covariance for confirmatory randomized clinicaltrials to evaluate doseresponse relationships.
Stat Med. 2001 Sep 15-30;20(17-18):2585-607.
12: Yusuf S, Wittes J, Probstfield J, Tyroler HA.
Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials.
JAMA. 1991 Jul 3;266(1):93-8.
13: Wang SJ, Hung HM.
Adaptive covariate adjustment in clinical trials.
J Biopharm Stat. 2005;15(4):605-11.
14: Rothwell PM.
Treating individuals 2. Subgroup analysis in randomised controlled trials:importance, indications, and
interpretation.
Lancet. 2005 Jan 8-14;365(9454):176-86.
15: Grouin JM, Coste M, Lewis J.
Subgroup analyses in randomized clinical trials: statistical and regulatory issues.
J Biopharm Stat. 2005;15(5):869-82.
16: Lesaffre E, Bogaerts K, Li X, Bluhmki E.
On the variability of covariate adjustment. experience with Koch's method for evaluating the absolute
difference in proportions in randomized clinical trials.
Control Clin Trials. 2002 Apr;23(2):127-42.
17: DeMets DL.
Statistical issues in interpreting clinical trials.
J Intern Med. 2004 May;255(5):529-37. Review.
18: Laopaiboon M.
Meta-analyses involving cluster randomization trials: a review of published literature in health care.
Stat Methods Med Res. 2003 Dec;12(6):515-30. Review.
19: Grundy SM, Cleeman JI, Merz CN, Brewer HB Jr, Clark LT, Hunninghake DB, Pasternak RC, Smith
SC Jr, Stone NJ; Coordinating Committee of the National Cholesterol Education Program.
Implications of recent clinical trials for the National Cholesterol Education Program Adult Treatment
Panel III Guidelines.
J Am Coll Cardiol. 2004 Aug 4;44(3):720-32. Review.
20: Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, Peters TJ.
Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for
the interaction test.
J Clin Epidemiol. 2004 Mar;57(3):229-36.
Klockars AJ and Hancock GR.(1998). A more powerful post hoc comparison
procedure in analysis of variance. J. of Educational and Behavioral
Statistics, 23, 279-289.
7
Download