Hypothesis testing and Confidence Intervals

advertisement
FUNDAMENTALS OF MEDICAL RESEARCH
Ed Gracely, Ph.D.
Family, Community, and Preventive Medicine
Hypothesis testing and Confidence Intervals
July 15, 2014
Goals. Attendees should be able to:
1) Explain the basic logic of hypothesis testing.
2) Define and explain "p".
3) Define and distinguish type I and type II errors, alpha.
4) Define and distinguish statistical and clinical significance.
5) State and explain some concerns and issues with multiple comparisons and analyses.
6) Define, explain, and interpret confidence intervals.
7) Explain how confidence intervals are used to test statistical hypotheses
A. Hypothesis Testing
Ex: [Graham DY et al, Annals of Internal Medicine. 116(9):705-8, 1992 May 1].
These authors assigned 89 patients with recently healed duodenal ulcers and proven H.
Pylori infection to either an antacid or the same antacid plus antibiotics. Patients were
followed up to 2 years. Recurrence of ulcer occurred in 95% of the antacid alone patients
but in only 12% of the combined therapy group.
If there were no benefit to treatment, the two samples would differ by this much or more
in fewer than one experiment out of 1,000 (by chance alone). You do not need to be able
to calculate this!
Ignoring bad design and fraud, there are 2 possible explanations:
1) Real difference between treatments.
2) This is one of those fewer-than 1 in 1,000 experiments with such a large
difference by chance.
Most people would pick "real difference"!
The probability of getting a difference so large, if there is no population (real)
difference, is the p-value so often reported in the literature.
Here, p < 1/1,000 that is p < 0.001
Statistically significant: said of a result for which the you conclude in favor of a
difference due to a small p value.
In other words, if the addition of antibiotics actually has no special benefit, we would
expect to find no difference in the sample. We realize, however, that actual sample results
will randomly vary from this expectation. If many studies were done, all with two
equally-effective treatments, 1 study in 1,000 would have results as good as (or better
1
than) the example study purely by chance. So results like these COULD occur by chance,
but are very unlikely.
The cutoff for significance is symbolized α (alpha). Normally the cutoff for a significant
p is set to 0.05. It can be smaller than 0.05 if you have reason to be stricter (for example
if there are many comparisons in a study).
Other examples: Each contains a quote and an interpretation.
Ex 2: "The mean score on the Crohn's Disease activity index after 16 weeks of
treatment was significantly lower in the methotrexate group (162) than in the
placebo group (204), p = 0.002", Feagan, BG et al. NEJM, Feb 2, 1995
If the treatment really makes no difference, (that is, the population mean difference
= 0) then the probability of a mean difference of 42 or more (i.e., 204 - 162) in the
samples would be 0.002.
Ex 3: “OBJECTIVE: To evaluate the effectiveness of a health-visitor-led intervention
for failure to thrive in children under 2 years old…. When the children were last
weighed, 91 (76%) in the intervention group had recovered from their failure to
thrive compared with 60 (55%) in the control group (P<0.001).” BMJ.
317(7158):571-4, 1998 Aug 29
So, if the intervention truly made no difference, the probability of a difference this
great (76% versus 55%) or more in recovery rate by chance alone is less than
.001.
Q: Which of examples 2 and 3 are statistically significant, using alpha = 0.05? How
about if the researcher had used alpha = 0.001?
Clinically significant: A result big enough to influence how you treat or advise
patients. Not all statistically significant results are clinically significant.
Type I and type II errors:
In clinical testing it is useful to know how likely it is that a test will produce a false positive
result (healthy patient indicated to have disease) and a false negative result (sick patient
not recognized as such by the test). We use similar concepts with statistical analyses.
Sometimes there truly is no benefit to a new treatment (or no difference between treatments)
but the data indicate that there is (with p < 0.05) entirely by chance. This is a kind of false
positive, and we call it a "Type I error". The probability of a type I error is the alpha value
we discussed above.
In other cases (perhaps more commonly) there really is a difference, but our study fails to
find it, again by chance. This is a type II error, a kind of false negative. Type II errors and
power (a related concept) are the subject of a later class in this series.
Ex: OK, so patient Jones has a fasting blood glucose of 130. Is Jones diabetic or could this
be a false positive? You can't tell from one test -- only a replication or additional data
(glucose tolerance test, say) will answer the question.
2
Similarly in statistics. You get p = 0.04. Is that a real difference or a type I error. Only a
replication of additional data will answer the question. We'll assume for now that it's real,
but replication and confirmation are the lifeblood of science.
The main use for these two error terms is in critiquing studies.
Q: Fill in the blanks. From an actual review I did a year or two ago (modified to
conceal details) "… out of many analyses, you report and discuss one result with
p = 0.02 and a few others. This raises the issue of
error (false positive
error) unless you had anticipated this as the effect of most interest, which does not
appear to be the case. So you need to clearly consider chance as a possible
explanation for the findings."
Q: (Invented review): "You argue that your new treatment is just as good as the old
one, because you found no difference between them. But you only had 10 subjects
in each group. I think the possibility of
error must be dealt
with."
B. Multiple analyses:
Type I error becomes a major issue when a study reports many analyses, especially if only a
few of them are significant.
Rule 1:
In a study with numerous p-values, it is likely that an occasional one will be significant
by chance, even if there is no truly interesting difference or association involved. Be wary
of studies that seem to engage in "data fishing", i.e., working very hard to find a few
gems of statistical significance in a pile of non-significant findings. "Thus," says the
article, "we randomized cancer patients to treatment and control. Overall there were no
differences on rate of recurrence or death (p's > 0.05) between the two groups, but the
treatment did reduce mortality (p = 0.03 compared to control) in elderly patients with the
highest tumor grade. Future studies to investigate the benefits of this therapy in high risk
older patients may be warranted". Well, maybe. Or maybe not.
Rule 2: The concern about type I error is greatly reduced if
 many of the results are significant (since it is quite unlikely they are all type I
errors) or
 the few significant results have strong p-values, like 0.01 or smaller or
 a few critical comparisons had been indicated as primary in advance, and these are
significant. It is good research methodology to indicate your primary hypotheses
and analyses in advance.
Far too many studies have non-replicable results, sometimes because of playing with the
data. Non-significant analyses are not reported. Only the most interesting comparisons
are included. Composite outcome variables for comparison are crafted after the data is
reviewed. All of this is very bad practice. Research should be "scripted" in advance.
How are you going to analyze the data? What will be reported? Etc…
3
C. Confidence Intervals
Confidence intervals: these are intervals computed around a sample statistic, and which
are intended to be highly likely to contain the true, or "population", value of that statistic
within them. Commonly 95% confidence intervals are used. What we did with the SEM is
a rough confidence interval.
Remember: a sample statistic is just an estimate for the true value. Thus a sample mean
is not necessarily the mean in the population of patients like those seen. The sample
mean estimates that population mean, but will differ from it randomly even in a welldone study.
The same is true for other parameters, like relative risks and percentages.
Ex: Free walking speed in a sample of patients with a leg amputation was found to have a
mean of 60.6 m/min (95% CI: 52.6 - 68.6). [Powers et al, PT April 1996]. This indicates
that the sample mean FWS was 60.6, and that the researcher is 95% sure that the true
(population) mean FWS for similar patients is between 52.6 and 68.6.
Ex: Hogben et al, Obstetrics and Gynecology Oct 2002 investigated in a survey the kinds
of screening different types of doctors did for STIs. One result was that among the 647
responding Ob/Gyns, 55% screened non-pregnant women for Chlamydia (95%
confidence interval 51% to 59%).
Q:
The statistic of interest here is not a mean. It is the percentage of Ob/Gyns who
screen. 55% screened non-pregnant women for Chlamydia. This is a (sample //
population) statistic. Based on the 95% confidence interval, we are 95% sure that
the (sample // population) screening rate by Ob/Gyns for Chlamydia is between
and
.a
INTERPRETING CI FOR RELATIVE RISKS AND ODDS RATIOS
Relative risks and odds ratios are often presented with 95% CI as a critical part of their
interpretation.
First a few questions to be sure you are with me on relative risks:
Q: A researcher might discover that the relative risk for poison ivy was 25 in people who
walk through the woods a lot compared to controls who rarely go there. This means
that: b
a. 25% of people who walk through the woods get poison ivy.
b. There were 25 people in the study who walked in the woods.
c. Poison ivy was 25 times as frequent in people who walked in the woods as in
controls.
d. Picking up those “shiny, pretty leaves” is fun and useful (Hint: wrong answer).
Q: Based on a single study, the relative risk of 25 is best regarded as a (SAMPLE //
POPULATION) value. c
4
Confidence intervals for many statistics are interpreted in one of two ways:
 In many cases, researchers are interested in the largest or smallest plausible
value for the population statistic. A confidence interval can help determine
whether certain values or ranges of values can be "ruled out" or "in" with high
probability.
 Often the most important question is whether the value that would represent
no effect or no association can be ruled out. If it can be, then the confidence
interval indicates statistical significance even without a p-value.
The "no difference" value for a relative risk or odds ratio is 1.0. If 1 is not in the
95% confidence interval, you may infer p < 0.05 and find for a difference.
Ex: The exposed children had a 3 times greater risk of a rash than did the unexposed kids
(RR = 3, 95% CI: 2 - 4.5). Fill in/select: In the (sample / population), exposed children
were 3 as likely as unexposed to develop a rash. This value is consistent with (probably)
a true relative risk of between
and
for exposure. This (IS/ IS NOT) good
statistical evidence for an increased risk, since
is not in the interval. d
Q: Intubated infants treated with phenobarbital reported to have an odds ratio for
hemorrhage of 2.1, with 95% confidence interval 1.2 - 3.7. (Pediatrics, April 1986,
Kuban et al.) What does this tell you? e
Suppose it had said, OR = 2.1, 95% CI: 0.7 to 6.3. Would this still be statistically
significant evidence of increased risk? f
Ex: Davis et al, NEJM, V 339, 1998, 1493- report that 18% of hepatitis C patients treated
with interferon + ribivarin (an anti-viral agent) had detectable HCV levels at the end of
treatment, as compared to 53% of those treated with interferon alone. The relative risk
for detectable HCV is 0.34 (95% CI: 0.23 to 0.51) using interferon alone as the control.
Q: The relative risk is less than 1.0. This means: g
a. The researchers made a mistake.
b. There were very few cases of the disease.
c. The exposed (treated) group had a lower rate of bad outcomes than the
controls.
d. The exposure (treatment) has a beneficial or protective effect.
e. c and d.
Q: T/F: 95% of subjects in the study had a relative risk between 0.23 and 0.51. h
Q: T/F: 95% of subjects in the population have a relative risk between 0.23 and 0.51. i
QUIZ
1.
A researcher reports (Arch Fam med, May-Jun 1998) that “NSAID users have a 2.24
relative risk of developing symptomatic diverticular disease over 4 years compared to nonusers (95% CI: 1.28 to 3.91).”
5
The best interpretation is:
a.
b.
c.
d.
95% of NSAID users in the sample have a risk between 1.28 and 3.91 times that in the
controls.
95% of NSAID users in the population have a risk between 1.28 and 3.91 times that in
the controls.
There is a 95% probability that the sample relative risk for NSAID users is between
1.28 and 3.91.
There is a 95% probability that the population relative risk for NSAID users is between
1.28 and 3.91.5.
2.
Pick all that apply for the data in # 1. This 95% confidence interval:
a. provides good evidence for a true (population) relative risk > 1.
b. is consistent with the population relative risk being 2.
c. provides good evidence that the population relative risk is > 2.
d. is consistent with the population relative risk being 4.
e. provides good evidence that the population relative risk is < 4.
3.
A group of subjects with a low left ventricular ejection fraction was studied. After they were
given placebo, their mean (+ SD) exercise time was 300 + 130 sec. After methoxamine it
was 600 + 250, p = 0.001. The best interpretation of this result is:
a. The change is too small to be significant at all.
b. The change is clinically but not statistically significant.
c. The change is quite unlikely to be due to chance alone.
d. Only 0.001 of the subjects failed to improve in exercise time.
4.
When a result involving the comparison of two conditions is statistically significant, this
means:
a. There is definitely a true difference between the conditions.
b. There is strong statistical evidence for a difference between them.
c. The observed data in the two conditions differs by enough that it would be difficult for
chance alone to explain.
d. b and c
5.
Which of the following p values is most convincing for concluding in favor of a real
difference?
a. 0.9
b. 0.10
c. 0.05
d. 0.01
FMR.2_of_4.CI and hypothesis testing.doc
Quiz Answers:
1: d 2) a,b,e
a
b
c
d
e
f
g
h
i
3)
c 4) d 5) d
Sample. Population. 51% and 59%.
(c)
Sample. As are almost all research results.
Sample // 2 and 4.5 // Is, since 1.0 is not in the interval.
Phenobarbital increased the odds of hemorrhage by 2.1-fold in the sample. The population (or true) increase in
odds could be 1.2 to 3.7 fold. Since 1.0 is not in the interval, the result is significant.
No. The confidence interval now includes 1.0
(e)
F: Confidence intervals for relative risks don’t refer to individual subjects, in the sample or the population.
F: Confidence intervals for relative risks don’t refer to individual subjects, in the sample or the population.
6
Download