Logic Argument of Research Article - HUMIS

advertisement
Chapter 2-13. Reporting Confidence Intervals vs P Values and Reporting Trends Toward
Significance
It is becoming very common to use confidence intervals in place of p values, because so much
more information is conveyed.
Cummings and Rivara (2003) are avid proponents of this approach,
“We acknowledge that sometimes P values may serve a useful purpose,30 but we
recommend that point estimates and confidence intervals be used in preference to P
values in most instances.”
------30
Weinberg CR. It’s time to rehabilitate the P-value. Epidemiology. 2001;12:288-290.
Borenstein (1997) is an excellent article on the subject and well worth reading.
Reproducing Borenstein’s Figure 1,
clear
input x low high
6 -30 0
6 0 40
5 -4 0
5 0 5
4 -25 5
4 5 35
3 1 5
3 5 9
2 10 40
2 40 63
1 36 40
1 40 44
end
#delimit ;
twoway rbar low high x
, horizontal barwidth(.75) lcolor(black) color(yellow)
xline(0, lcolor(black) lwidth(medthick))
xtitle(" ") ytitle(" ") yscale(off)
ylabel(6 "A" 5 "B" 4 "C" 3 "D" 2 "E" 1 "F" , angle(horizontal))
ytick(none) xlabels(-70(10)70)
text(-0.25 -35 "Favors Placebo") text(-0.25 35 "Favors Drug")
text(6 -65 "A" , placement(center) size(large))
text(5 -65 "B" , placement(center) size(large))
text(4 -65 "C" , placement(center) size(large))
text(3 -65 "D" , placement(center) size(large))
text(2 -65 "E" , placement(center) size(large))
text(1 -65 "F" , placement(center) size(large))
scheme(s2mono)
;
#delimit cr
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 2-13 (revision 16 May 2010)
p. 1
A
B
C
D
E
F
-70
-60
-50
-40
-30
-20
Favors Placebo
-10
0
10
20
30
40
50
60
70
Favors Drug
Borenstein (1997) provides the following interpretation of these six fictional studies,
“Study A: Zero difference (-30 to +30). We cannot rule out the possibility that the effect
is nil. Nor, however, can we rule out the possibility that the effect is large enough to be
clinically useful or clinically harmful.
Study B: Zero difference (-4 to +4). We cannot rule out the possibility that the effect is
nil. More to the point, however, it is clear the effect is trivial (at best).
Study C: Five point difference (-25 to +35). The effect size in the sample (and out best
estimate of the population effect) is 5 points. We cannot rule out the possibility tha the
effect is nil (or even harmful) nor can we rule out the possibility that it is quite potent.
Study D: Five point difference (1 to 9). The effect is probably not nil. More to the point,
however, we can assert with a high level of certainty that the effect is not clinically
important (at best a 9-point advantage in favor of the drug).
Study E: 40 point difference (10 to 63). The effect is probably not nil. The possible
magnitude of the effect ranges from small to extremely potent (the interval is
intentionally asymmetric).
Study F: 40 point difference (36 to 44). The drug is quite potent. The likely range of
effects falls entirely within the “very potent” range.”
Chapter 2-13 (revision 16 May 2010)
p. 2
Most authors use a combination of p values and confidence intervals in their paper. Even though
more information is conveyed with confidence intervals, it is difficult to judge the interval bound
that just covers the null value. For example, if you are given the interval (-2 , 10), it is difficult to
assess if the overlap to the left of 0 is of concern.
Just throwing some numbers out, without the support of a reference, knowing the p value is
0.051 up to perhaps 0.10 helps the reader understand the overlap is of little consequence.
Similarly, knowing the p value is 0.15 or greater assists the reader to realize the overlap suggests
a null effect. If the sample size is small, a p value between 0.10 and 0.15 might still suggest an
effect that could be clarified with a larger sample size.
Motzer et al (NEJM, 2007) uses both p value and confidence intervals. This paper has a very
impressive figure containing confidence intervals, in a format that is becoming increasingly
popular. The figure represents a subgroup analysis expressed as confidence intervals, which is
referred to in the text by (p. 119),
“We analyzed the influence of baseline clinical features and previously identified
prognostic factors20 on the treatment effect with the use of a Cox proportional-hazards
model, controlling for each factor at a time. The benefit of sunitunib over interferon alfa
was observed across all subgroups of patients (Fig. 3).”
Exercise. Look at Motzer’s figure 3. Notice how natural it is to interpret the CIs in a Borenstein
fashion.
Garder and Altman (1986) remind us that the actual goal of medical research is not to test the
null hypothesis, but rather to determine the magnitude of some factor of interest,
“Over the past two or three decade the use of statistics in medical journals has
increased tremendously. One unfortunate consequence has been a shift in emphasis away
from the basic results towards an undue concentration on hypothesis testing. In this
approach data are examined in relation to a statistical ‘null’ hypothesis, and the practice
has led to the mistaken belief that studies should aim at obtaining ‘statistical
significance.’ On the contrary, the purpose of most research investigations in medicine is
to detemrine the magnitude of some factor(s) of interest.
For example, a laboratory based study may investigate the difference in mean
concentrations of a blood constituent between patients with and without a certain illness,
while a clinical study may assess the difference in prognosis of patients with a particular
disease treated by alternative regimens in terms of rates of cure, remission, relapse,
survival, etc. The difference obtained in such a study will be only an estimate of what we
really need, which is the result that would have been obtained had all the eligible subjects
(the “population”) been investigated rather than just a sample of them. What authors and
readers should want to know is by how much the illness modified the mean blood
concentrations or by how much the new treatment altered the prognosis, rather than only
the level of statistical significance.
Chapter 2-13 (revision 16 May 2010)
p. 3
The excessive use of hypothesis testing at the expense of other ways of assessing
results has reached such a degree that levels of significance are often quoted alone in the
main text and abstracts of papers, with no mention of atual concentrations, proportions,
etc, or their differences. The implication of hypothesis testing—that there can always be
a simple ‘yes’ or ‘no’ answer as the fundamental result from a medical study—is clearly
false and used in this way hypothesis testing is of limited value.2”
-------------2
Altman DG, Gore SM, Gardner MJ, Pocock SJ. Statistical guidelines for contributors to
medical journals. Br Med J 1983;286:1489-93.
Chapter 2-13 (revision 16 May 2010)
p. 4
Making the argument of a significant effect when the p value is not significant
<< this section is under construction: for now it is just a series of quotes that I will
eventually develop into a conherent argument >>
Royal (1997, p.62) quotes Burdette and Gehan’s (1970, p.9) convention for interpreting the
strength of evidence offered by the p value against the null hypothesis:
“Reasonable interpretations of the results of significance tests are as follows:
Significance Level of Data
Interpretation
Less than 1 per cent
Very strong evidence
Against the null hypothesis
1 per cent to 5 per cent
Moderate evidence against the
null hypothesis
More than 5 per cent and less
than 10 per cent
Suggestive evidence against
the null hypothesis
10 per cent or more
Little or no real evidence
Against the null hypothesis.”
Lang and Secic (2006, p.58) state,
“For differences that are clinically important but not statistically significant, do not report
a ‘trend toward significance.’ Instead, report the observed difference and the (95%)
confidence interval for the difference. When authors find a clinically important
difference that is not statistically significant, they sometimes report that the difference
shows a ‘trend’ (it cannot), it could just as easily move ‘away from’ the alpha level as
‘toward’ it. The point is that clinically important results should not be overlooked
because they are not statistically significant (12).”
_____
12. Gardner MJ, Altman D. Confidence intervals rather than P values: estimation rather than hypothesis
testing. BMJ 1986;292:746-50.
Altman, Gore, et al (1983, section 4.3) state,
“Calling any value with p>0.05 “not significant” is not recommended, as it may obscure
results that are not quite statistically significant but do suggest a real effect (see section
5.1).”
Chapter 2-13 (revision 16 May 2010)
p. 5
and clarify further (1983, section 5.1) with,
“Some flexibility is desirable in interpreting p values. The 0.05 level is a convenient cut
off point, but p values of 0.04 and 0.06, which are not greatly different, ought to lead to
similar interpretations, rather than radically different ones. The designation of any result
with p > 0.05 as not significant may thus mislead the reader (and the authors); hence the
suggestion in section 4.3 to quote actual p values.”
Kenneth Rothman (2002, pp.113-129) presents a nice discussion of p values, p value functions,
and confidence intervals (CI). He points out that a p value is a probability of the null value (0 for
a difference, 1 for a ratio) rather than the observed effect. If one computed a p value for every
possible value of the effect size, then when p=1.0, that effect size would be the one most
supported by the p value. All of these possible p values represent the “p value function”. A CI is
a similar type of presentation. The center of the CI is one’s best estimate of the effect size, and
the limits, or endpoints, of the CI define the gray zone. Although Rothman does not state it in a
simple quotable form, he presents the idea that a 95% CI that just slightly crosses the null value
(and so the p value is just slightly > 0.05) provides much more evidence for the observed value
than it does for a chance occurrence, or the null value.
Example of reporting a p value between 0.05 and 0.10
In The New England Journal of Medicine, it is popular for authors to include a statement like the
following statement in the Statistical Analysis section: 1) A two-sided P value of less than 0.05
was considered to indicate statistical significance, or 2) All P values were calculated with twosided tests, and no correction was made for multiple testing, or 3) say nothing about P values at
all.
Karunajeewa et al (N Engl J Med, 2008) reported several significant findings (p<0.05), but they
had a few p values between 0.05 and 0.10 and referred to them as trends. They reported (page
2551),
“Among children who had falciparum malaria, in univariate analyses, there was a trend
toward a lower risk of any treatment failure (not corrected through PCR genotyping) at
day 7 with a higher plasma piperaquine level in the dihydroartemisinin-piperaquine group
(hazard ratio for each increase of 10 µg per liter, 0.86; 95% CI, 0.73 to 1.01; P=0.06) and
with a higher plasma lumefantrine level in the artemether-lumefanrine group (hazard ratio
for each increase of 100 µg per liter, 0.87; 95% CI, 0.74 to 1.02; P=0.09). According to
the Cox model of treatment failure in the dihydroartemisinin-piperaquine group, after
correction through PCR geneotyping, there was a trend toward association in plasma
piperaquine levels at day 7 (P=0.08), but the nutrition z score according to weight for age
was no longer significantly associated (P=0.25).”
In their Statistical Methods section, they avoided the “P<0.05 as significant” phrase and stated,
“All P values are two-tailed and were not adjusted for multiple comparisons.”
Chapter 2-13 (revision 16 May 2010)
p. 6
References
Borenstein M. (1997). Hypothesis testing and effect size estimation in clinical trials. Ann
Allergy Asthma Immunol 78:5-16.
Burdette WJ, Gehan EA. (1970). Planning and Analysis of Clinical Studies, Springfield
IL: Charles C. Thomas.
Cummings P, Rivara FP. (2003). Reporting statistical information in medical journal articles.
Arch Pediatr Adolesc Med 157:321-324.
Gardner MJ, Altman D. (1986). Confidence intervals rather than P values: estimation rather than
hypothesis testing. BMJ 292:746-50.
Karunajeewa HA, Mueller I, Senn M, et al. (2008). A trial of combination antimalarial
therapies in children from Paua New Guinea. N Engl J Med 359;2545-57.
Lang TA, Secic M. (2006). How to Report Statistics in Medicine. Annotated Guidelines
for Authors, Editors, and Reviewers. 2nd ed. Philadelphia, American College of
Physicians.
Motzer RJ, Hutson TE, Tomczak P, et al. (2007). Sunitinib versus interferon alfa in metastatic
renal-cell carcinoma. N Engl J Med 356(2):115-24.
Rothman KL. (2002) Epidemiology: An Introduction. New York, Oxford University
Press.
Royall R. (1997). Statisical Evidence: A Likelihood Paradigm, New York, Chapman &
Hall/CRC.
Chapter 2-13 (revision 16 May 2010)
p. 7
Download