Hypothesis testing and Confidence Intervals

FUNDAMENTALS OF MEDICAL RESEARCH Ed Gracely, Ph.D. Family, Community, and Preventive Medicine Hypothesis testing and Confidence Intervals July 15, 2014 Goals. Attendees should be able to: 1) Explain the basic logic of hypothesis testing. 2) Define and explain "p". 3) Define and distinguish type I and type II errors, alpha. 4) Define and distinguish statistical and clinical significance. 5) State and explain some concerns and issues with multiple comparisons and analyses. 6) Define, explain, and interpret confidence intervals. 7) Explain how confidence intervals are used to test statistical hypotheses A. Hypothesis Testing Ex: [Graham DY et al, Annals of Internal Medicine. 116(9):705-8, 1992 May 1]. These authors assigned 89 patients with recently healed duodenal ulcers and proven H. Pylori infection to either an antacid or the same antacid plus antibiotics. Patients were followed up to 2 years. Recurrence of ulcer occurred in 95% of the antacid alone patients but in only 12% of the combined therapy group. If there were no benefit to treatment, the two samples would differ by this much or more in fewer than one experiment out of 1,000 (by chance alone). You do not need to be able to calculate this! Ignoring bad design and fraud, there are 2 possible explanations: 1) Real difference between treatments. 2) This is one of those fewer-than 1 in 1,000 experiments with such a large difference by chance. Most people would pick "real difference"! The probability of getting a difference so large, if there is no population (real) difference, is the p-value so often reported in the literature. Here, p < 1/1,000 that is p < 0.001 Statistically significant: said of a result for which the you conclude in favor of a difference due to a small p value. In other words, if the addition of antibiotics actually has no special benefit, we would expect to find no difference in the sample. We realize, however, that actual sample results will randomly vary from this expectation. If many studies were done, all with two equally-effective treatments, 1 study in 1,000 would have results as good as (or better 1 than) the example study purely by chance. So results like these COULD occur by chance, but are very unlikely. The cutoff for significance is symbolized α (alpha). Normally the cutoff for a significant p is set to 0.05. It can be smaller than 0.05 if you have reason to be stricter (for example if there are many comparisons in a study). Other examples: Each contains a quote and an interpretation. Ex 2: "The mean score on the Crohn's Disease activity index after 16 weeks of treatment was significantly lower in the methotrexate group (162) than in the placebo group (204), p = 0.002", Feagan, BG et al. NEJM, Feb 2, 1995 If the treatment really makes no difference, (that is, the population mean difference = 0) then the probability of a mean difference of 42 or more (i.e., 204 - 162) in the samples would be 0.002. Ex 3: “OBJECTIVE: To evaluate the effectiveness of a health-visitor-led intervention for failure to thrive in children under 2 years old…. When the children were last weighed, 91 (76%) in the intervention group had recovered from their failure to thrive compared with 60 (55%) in the control group (P<0.001).” BMJ. 317(7158):571-4, 1998 Aug 29 So, if the intervention truly made no difference, the probability of a difference this great (76% versus 55%) or more in recovery rate by chance alone is less than .001. Q: Which of examples 2 and 3 are statistically significant, using alpha = 0.05? How about if the researcher had used alpha = 0.001? Clinically significant: A result big enough to influence how you treat or advise patients. Not all statistically significant results are clinically significant. Type I and type II errors: In clinical testing it is useful to know how likely it is that a test will produce a false positive result (healthy patient indicated to have disease) and a false negative result (sick patient not recognized as such by the test). We use similar concepts with statistical analyses. Sometimes there truly is no benefit to a new treatment (or no difference between treatments) but the data indicate that there is (with p < 0.05) entirely by chance. This is a kind of false positive, and we call it a "Type I error". The probability of a type I error is the alpha value we discussed above. In other cases (perhaps more commonly) there really is a difference, but our study fails to find it, again by chance. This is a type II error, a kind of false negative. Type II errors and power (a related concept) are the subject of a later class in this series. Ex: OK, so patient Jones has a fasting blood glucose of 130. Is Jones diabetic or could this be a false positive? You can't tell from one test -- only a replication or additional data (glucose tolerance test, say) will answer the question. 2 Similarly in statistics. You get p = 0.04. Is that a real difference or a type I error. Only a replication of additional data will answer the question. We'll assume for now that it's real, but replication and confirmation are the lifeblood of science. The main use for these two error terms is in critiquing studies. Q: Fill in the blanks. From an actual review I did a year or two ago (modified to conceal details) "… out of many analyses, you report and discuss one result with p = 0.02 and a few others. This raises the issue of error (false positive error) unless you had anticipated this as the effect of most interest, which does not appear to be the case. So you need to clearly consider chance as a possible explanation for the findings." Q: (Invented review): "You argue that your new treatment is just as good as the old one, because you found no difference between them. But you only had 10 subjects in each group. I think the possibility of error must be dealt with." B. Multiple analyses: Type I error becomes a major issue when a study reports many analyses, especially if only a few of them are significant. Rule 1: In a study with numerous p-values, it is likely that an occasional one will be significant by chance, even if there is no truly interesting difference or association involved. Be wary of studies that seem to engage in "data fishing", i.e., working very hard to find a few gems of statistical significance in a pile of non-significant findings. "Thus," says the article, "we randomized cancer patients to treatment and control. Overall there were no differences on rate of recurrence or death (p's > 0.05) between the two groups, but the treatment did reduce mortality (p = 0.03 compared to control) in elderly patients with the highest tumor grade. Future studies to investigate the benefits of this therapy in high risk older patients may be warranted". Well, maybe. Or maybe not. Rule 2: The concern about type I error is greatly reduced if  many of the results are significant (since it is quite unlikely they are all type I errors) or  the few significant results have strong p-values, like 0.01 or smaller or  a few critical comparisons had been indicated as primary in advance, and these are significant. It is good research methodology to indicate your primary hypotheses and analyses in advance. Far too many studies have non-replicable results, sometimes because of playing with the data. Non-significant analyses are not reported. Only the most interesting comparisons are included. Composite outcome variables for comparison are crafted after the data is reviewed. All of this is very bad practice. Research should be "scripted" in advance. How are you going to analyze the data? What will be reported? Etc… 3 C. Confidence Intervals Confidence intervals: these are intervals computed around a sample statistic, and which are intended to be highly likely to contain the true, or "population", value of that statistic within them. Commonly 95% confidence intervals are used. What we did with the SEM is a rough confidence interval. Remember: a sample statistic is just an estimate for the true value. Thus a sample mean is not necessarily the mean in the population of patients like those seen. The sample mean estimates that population mean, but will differ from it randomly even in a welldone study. The same is true for other parameters, like relative risks and percentages. Ex: Free walking speed in a sample of patients with a leg amputation was found to have a mean of 60.6 m/min (95% CI: 52.6 - 68.6). [Powers et al, PT April 1996]. This indicates that the sample mean FWS was 60.6, and that the researcher is 95% sure that the true (population) mean FWS for similar patients is between 52.6 and 68.6. Ex: Hogben et al, Obstetrics and Gynecology Oct 2002 investigated in a survey the kinds of screening different types of doctors did for STIs. One result was that among the 647 responding Ob/Gyns, 55% screened non-pregnant women for Chlamydia (95% confidence interval 51% to 59%). Q: The statistic of interest here is not a mean. It is the percentage of Ob/Gyns who screen. 55% screened non-pregnant women for Chlamydia. This is a (sample // population) statistic. Based on the 95% confidence interval, we are 95% sure that the (sample // population) screening rate by Ob/Gyns for Chlamydia is between and .a INTERPRETING CI FOR RELATIVE RISKS AND ODDS RATIOS Relative risks and odds ratios are often presented with 95% CI as a critical part of their interpretation. First a few questions to be sure you are with me on relative risks: Q: A researcher might discover that the relative risk for poison ivy was 25 in people who walk through the woods a lot compared to controls who rarely go there. This means that: b a. 25% of people who walk through the woods get poison ivy. b. There were 25 people in the study who walked in the woods. c. Poison ivy was 25 times as frequent in people who walked in the woods as in controls. d. Picking up those “shiny, pretty leaves” is fun and useful (Hint: wrong answer). Q: Based on a single study, the relative risk of 25 is best regarded as a (SAMPLE // POPULATION) value. c 4 Confidence intervals for many statistics are interpreted in one of two ways:  In many cases, researchers are interested in the largest or smallest plausible value for the population statistic. A confidence interval can help determine whether certain values or ranges of values can be "ruled out" or "in" with high probability.  Often the most important question is whether the value that would represent no effect or no association can be ruled out. If it can be, then the confidence interval indicates statistical significance even without a p-value. The "no difference" value for a relative risk or odds ratio is 1.0. If 1 is not in the 95% confidence interval, you may infer p < 0.05 and find for a difference. Ex: The exposed children had a 3 times greater risk of a rash than did the unexposed kids (RR = 3, 95% CI: 2 - 4.5). Fill in/select: In the (sample / population), exposed children were 3 as likely as unexposed to develop a rash. This value is consistent with (probably) a true relative risk of between and for exposure. This (IS/ IS NOT) good statistical evidence for an increased risk, since is not in the interval. d Q: Intubated infants treated with phenobarbital reported to have an odds ratio for hemorrhage of 2.1, with 95% confidence interval 1.2 - 3.7. (Pediatrics, April 1986, Kuban et al.) What does this tell you? e Suppose it had said, OR = 2.1, 95% CI: 0.7 to 6.3. Would this still be statistically significant evidence of increased risk? f Ex: Davis et al, NEJM, V 339, 1998, 1493- report that 18% of hepatitis C patients treated with interferon + ribivarin (an anti-viral agent) had detectable HCV levels at the end of treatment, as compared to 53% of those treated with interferon alone. The relative risk for detectable HCV is 0.34 (95% CI: 0.23 to 0.51) using interferon alone as the control. Q: The relative risk is less than 1.0. This means: g a. The researchers made a mistake. b. There were very few cases of the disease. c. The exposed (treated) group had a lower rate of bad outcomes than the controls. d. The exposure (treatment) has a beneficial or protective effect. e. c and d. Q: T/F: 95% of subjects in the study had a relative risk between 0.23 and 0.51. h Q: T/F: 95% of subjects in the population have a relative risk between 0.23 and 0.51. i QUIZ 1. A researcher reports (Arch Fam med, May-Jun 1998) that “NSAID users have a 2.24 relative risk of developing symptomatic diverticular disease over 4 years compared to nonusers (95% CI: 1.28 to 3.91).” 5 The best interpretation is: a. b. c. d. 95% of NSAID users in the sample have a risk between 1.28 and 3.91 times that in the controls. 95% of NSAID users in the population have a risk between 1.28 and 3.91 times that in the controls. There is a 95% probability that the sample relative risk for NSAID users is between 1.28 and 3.91. There is a 95% probability that the population relative risk for NSAID users is between 1.28 and 3.91.5. 2. Pick all that apply for the data in # 1. This 95% confidence interval: a. provides good evidence for a true (population) relative risk > 1. b. is consistent with the population relative risk being 2. c. provides good evidence that the population relative risk is > 2. d. is consistent with the population relative risk being 4. e. provides good evidence that the population relative risk is < 4. 3. A group of subjects with a low left ventricular ejection fraction was studied. After they were given placebo, their mean (+ SD) exercise time was 300 + 130 sec. After methoxamine it was 600 + 250, p = 0.001. The best interpretation of this result is: a. The change is too small to be significant at all. b. The change is clinically but not statistically significant. c. The change is quite unlikely to be due to chance alone. d. Only 0.001 of the subjects failed to improve in exercise time. 4. When a result involving the comparison of two conditions is statistically significant, this means: a. There is definitely a true difference between the conditions. b. There is strong statistical evidence for a difference between them. c. The observed data in the two conditions differs by enough that it would be difficult for chance alone to explain. d. b and c 5. Which of the following p values is most convincing for concluding in favor of a real difference? a. 0.9 b. 0.10 c. 0.05 d. 0.01 FMR.2_of_4.CI and hypothesis testing.doc Quiz Answers: 1: d 2) a,b,e a b c d e f g h i 3) c 4) d 5) d Sample. Population. 51% and 59%. (c) Sample. As are almost all research results. Sample // 2 and 4.5 // Is, since 1.0 is not in the interval. Phenobarbital increased the odds of hemorrhage by 2.1-fold in the sample. The population (or true) increase in odds could be 1.2 to 3.7 fold. Since 1.0 is not in the interval, the result is significant. No. The confidence interval now includes 1.0 (e) F: Confidence intervals for relative risks don’t refer to individual subjects, in the sample or the population. F: Confidence intervals for relative risks don’t refer to individual subjects, in the sample or the population. 6

Hypothesis testing and Confidence Intervals

Related documents

Products

Support

Hypothesis testing and Confidence Intervals

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib