Chapter 2-13. Reporting Confidence Intervals vs P Values and Reporting Trends Toward Significance It is becoming very common to use confidence intervals in place of p values, because so much more information is conveyed. Cummings and Rivara (2003) are avid proponents of this approach, “We acknowledge that sometimes P values may serve a useful purpose,30 but we recommend that point estimates and confidence intervals be used in preference to P values in most instances.” ------30 Weinberg CR. It’s time to rehabilitate the P-value. Epidemiology. 2001;12:288-290. Borenstein (1997) is an excellent article on the subject and well worth reading. Reproducing Borenstein’s Figure 1, clear input x low high 6 -30 0 6 0 40 5 -4 0 5 0 5 4 -25 5 4 5 35 3 1 5 3 5 9 2 10 40 2 40 63 1 36 40 1 40 44 end #delimit ; twoway rbar low high x , horizontal barwidth(.75) lcolor(black) color(yellow) xline(0, lcolor(black) lwidth(medthick)) xtitle(" ") ytitle(" ") yscale(off) ylabel(6 "A" 5 "B" 4 "C" 3 "D" 2 "E" 1 "F" , angle(horizontal)) ytick(none) xlabels(-70(10)70) text(-0.25 -35 "Favors Placebo") text(-0.25 35 "Favors Drug") text(6 -65 "A" , placement(center) size(large)) text(5 -65 "B" , placement(center) size(large)) text(4 -65 "C" , placement(center) size(large)) text(3 -65 "D" , placement(center) size(large)) text(2 -65 "E" , placement(center) size(large)) text(1 -65 "F" , placement(center) size(large)) scheme(s2mono) ; #delimit cr _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 2-13 (revision 16 May 2010) p. 1 A B C D E F -70 -60 -50 -40 -30 -20 Favors Placebo -10 0 10 20 30 40 50 60 70 Favors Drug Borenstein (1997) provides the following interpretation of these six fictional studies, “Study A: Zero difference (-30 to +30). We cannot rule out the possibility that the effect is nil. Nor, however, can we rule out the possibility that the effect is large enough to be clinically useful or clinically harmful. Study B: Zero difference (-4 to +4). We cannot rule out the possibility that the effect is nil. More to the point, however, it is clear the effect is trivial (at best). Study C: Five point difference (-25 to +35). The effect size in the sample (and out best estimate of the population effect) is 5 points. We cannot rule out the possibility tha the effect is nil (or even harmful) nor can we rule out the possibility that it is quite potent. Study D: Five point difference (1 to 9). The effect is probably not nil. More to the point, however, we can assert with a high level of certainty that the effect is not clinically important (at best a 9-point advantage in favor of the drug). Study E: 40 point difference (10 to 63). The effect is probably not nil. The possible magnitude of the effect ranges from small to extremely potent (the interval is intentionally asymmetric). Study F: 40 point difference (36 to 44). The drug is quite potent. The likely range of effects falls entirely within the “very potent” range.” Chapter 2-13 (revision 16 May 2010) p. 2 Most authors use a combination of p values and confidence intervals in their paper. Even though more information is conveyed with confidence intervals, it is difficult to judge the interval bound that just covers the null value. For example, if you are given the interval (-2 , 10), it is difficult to assess if the overlap to the left of 0 is of concern. Just throwing some numbers out, without the support of a reference, knowing the p value is 0.051 up to perhaps 0.10 helps the reader understand the overlap is of little consequence. Similarly, knowing the p value is 0.15 or greater assists the reader to realize the overlap suggests a null effect. If the sample size is small, a p value between 0.10 and 0.15 might still suggest an effect that could be clarified with a larger sample size. Motzer et al (NEJM, 2007) uses both p value and confidence intervals. This paper has a very impressive figure containing confidence intervals, in a format that is becoming increasingly popular. The figure represents a subgroup analysis expressed as confidence intervals, which is referred to in the text by (p. 119), “We analyzed the influence of baseline clinical features and previously identified prognostic factors20 on the treatment effect with the use of a Cox proportional-hazards model, controlling for each factor at a time. The benefit of sunitunib over interferon alfa was observed across all subgroups of patients (Fig. 3).” Exercise. Look at Motzer’s figure 3. Notice how natural it is to interpret the CIs in a Borenstein fashion. Garder and Altman (1986) remind us that the actual goal of medical research is not to test the null hypothesis, but rather to determine the magnitude of some factor of interest, “Over the past two or three decade the use of statistics in medical journals has increased tremendously. One unfortunate consequence has been a shift in emphasis away from the basic results towards an undue concentration on hypothesis testing. In this approach data are examined in relation to a statistical ‘null’ hypothesis, and the practice has led to the mistaken belief that studies should aim at obtaining ‘statistical significance.’ On the contrary, the purpose of most research investigations in medicine is to detemrine the magnitude of some factor(s) of interest. For example, a laboratory based study may investigate the difference in mean concentrations of a blood constituent between patients with and without a certain illness, while a clinical study may assess the difference in prognosis of patients with a particular disease treated by alternative regimens in terms of rates of cure, remission, relapse, survival, etc. The difference obtained in such a study will be only an estimate of what we really need, which is the result that would have been obtained had all the eligible subjects (the “population”) been investigated rather than just a sample of them. What authors and readers should want to know is by how much the illness modified the mean blood concentrations or by how much the new treatment altered the prognosis, rather than only the level of statistical significance. Chapter 2-13 (revision 16 May 2010) p. 3 The excessive use of hypothesis testing at the expense of other ways of assessing results has reached such a degree that levels of significance are often quoted alone in the main text and abstracts of papers, with no mention of atual concentrations, proportions, etc, or their differences. The implication of hypothesis testing—that there can always be a simple ‘yes’ or ‘no’ answer as the fundamental result from a medical study—is clearly false and used in this way hypothesis testing is of limited value.2” -------------2 Altman DG, Gore SM, Gardner MJ, Pocock SJ. Statistical guidelines for contributors to medical journals. Br Med J 1983;286:1489-93. Chapter 2-13 (revision 16 May 2010) p. 4 Making the argument of a significant effect when the p value is not significant << this section is under construction: for now it is just a series of quotes that I will eventually develop into a conherent argument >> Royal (1997, p.62) quotes Burdette and Gehan’s (1970, p.9) convention for interpreting the strength of evidence offered by the p value against the null hypothesis: “Reasonable interpretations of the results of significance tests are as follows: Significance Level of Data Interpretation Less than 1 per cent Very strong evidence Against the null hypothesis 1 per cent to 5 per cent Moderate evidence against the null hypothesis More than 5 per cent and less than 10 per cent Suggestive evidence against the null hypothesis 10 per cent or more Little or no real evidence Against the null hypothesis.” Lang and Secic (2006, p.58) state, “For differences that are clinically important but not statistically significant, do not report a ‘trend toward significance.’ Instead, report the observed difference and the (95%) confidence interval for the difference. When authors find a clinically important difference that is not statistically significant, they sometimes report that the difference shows a ‘trend’ (it cannot), it could just as easily move ‘away from’ the alpha level as ‘toward’ it. The point is that clinically important results should not be overlooked because they are not statistically significant (12).” _____ 12. Gardner MJ, Altman D. Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ 1986;292:746-50. Altman, Gore, et al (1983, section 4.3) state, “Calling any value with p>0.05 “not significant” is not recommended, as it may obscure results that are not quite statistically significant but do suggest a real effect (see section 5.1).” Chapter 2-13 (revision 16 May 2010) p. 5 and clarify further (1983, section 5.1) with, “Some flexibility is desirable in interpreting p values. The 0.05 level is a convenient cut off point, but p values of 0.04 and 0.06, which are not greatly different, ought to lead to similar interpretations, rather than radically different ones. The designation of any result with p > 0.05 as not significant may thus mislead the reader (and the authors); hence the suggestion in section 4.3 to quote actual p values.” Kenneth Rothman (2002, pp.113-129) presents a nice discussion of p values, p value functions, and confidence intervals (CI). He points out that a p value is a probability of the null value (0 for a difference, 1 for a ratio) rather than the observed effect. If one computed a p value for every possible value of the effect size, then when p=1.0, that effect size would be the one most supported by the p value. All of these possible p values represent the “p value function”. A CI is a similar type of presentation. The center of the CI is one’s best estimate of the effect size, and the limits, or endpoints, of the CI define the gray zone. Although Rothman does not state it in a simple quotable form, he presents the idea that a 95% CI that just slightly crosses the null value (and so the p value is just slightly > 0.05) provides much more evidence for the observed value than it does for a chance occurrence, or the null value. Example of reporting a p value between 0.05 and 0.10 In The New England Journal of Medicine, it is popular for authors to include a statement like the following statement in the Statistical Analysis section: 1) A two-sided P value of less than 0.05 was considered to indicate statistical significance, or 2) All P values were calculated with twosided tests, and no correction was made for multiple testing, or 3) say nothing about P values at all. Karunajeewa et al (N Engl J Med, 2008) reported several significant findings (p<0.05), but they had a few p values between 0.05 and 0.10 and referred to them as trends. They reported (page 2551), “Among children who had falciparum malaria, in univariate analyses, there was a trend toward a lower risk of any treatment failure (not corrected through PCR genotyping) at day 7 with a higher plasma piperaquine level in the dihydroartemisinin-piperaquine group (hazard ratio for each increase of 10 µg per liter, 0.86; 95% CI, 0.73 to 1.01; P=0.06) and with a higher plasma lumefantrine level in the artemether-lumefanrine group (hazard ratio for each increase of 100 µg per liter, 0.87; 95% CI, 0.74 to 1.02; P=0.09). According to the Cox model of treatment failure in the dihydroartemisinin-piperaquine group, after correction through PCR geneotyping, there was a trend toward association in plasma piperaquine levels at day 7 (P=0.08), but the nutrition z score according to weight for age was no longer significantly associated (P=0.25).” In their Statistical Methods section, they avoided the “P<0.05 as significant” phrase and stated, “All P values are two-tailed and were not adjusted for multiple comparisons.” Chapter 2-13 (revision 16 May 2010) p. 6 References Borenstein M. (1997). Hypothesis testing and effect size estimation in clinical trials. Ann Allergy Asthma Immunol 78:5-16. Burdette WJ, Gehan EA. (1970). Planning and Analysis of Clinical Studies, Springfield IL: Charles C. Thomas. Cummings P, Rivara FP. (2003). Reporting statistical information in medical journal articles. Arch Pediatr Adolesc Med 157:321-324. Gardner MJ, Altman D. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ 292:746-50. Karunajeewa HA, Mueller I, Senn M, et al. (2008). A trial of combination antimalarial therapies in children from Paua New Guinea. N Engl J Med 359;2545-57. Lang TA, Secic M. (2006). How to Report Statistics in Medicine. Annotated Guidelines for Authors, Editors, and Reviewers. 2nd ed. Philadelphia, American College of Physicians. Motzer RJ, Hutson TE, Tomczak P, et al. (2007). Sunitinib versus interferon alfa in metastatic renal-cell carcinoma. N Engl J Med 356(2):115-24. Rothman KL. (2002) Epidemiology: An Introduction. New York, Oxford University Press. Royall R. (1997). Statisical Evidence: A Likelihood Paradigm, New York, Chapman & Hall/CRC. Chapter 2-13 (revision 16 May 2010) p. 7