EPI-820 Evidence-Based Medicine LECTURE 7: CLINICAL STATISTICAL INFERENCE Mat Reeves BVSc, PhD 1 Objectives • Understand the theoretical underpinnings and the flaws associated with the current approach to clinical statistical testing (the frequentist approach). • Understand the difference between testing and estimation • Understand the advantages of the CI and the CI functions. • Understand the logic of a Bayesian Approach 2 Personal Statistical History…. • Post-DVM • Clue-less. Sceptical of the role of statistics • Thinks research = the search for P < 0.05 • PhD Era: • Increasing obsession with stat methods • Lots of tools! SLR, ANOVA, MLR, LR, LL & Cox • Thinks statistics = “real science” • Post-PhD: • Healthy scepticism for the way stats are used • Stats = methods which have inherent limitations • Not a substitute for clear scientific thought or understanding the “scientific method” 3 Review of Significance Tests Substantive hypothesis: Cows on BST will tend to gain weight Null hypothesis (Ho): the mean body wt. of cows trt with BST is not different from the mean body wt. of control cows Ux = Uy Alternative hypothesis (Ha): the mean body wt. of cows trt with BST is different from the mean body wt. of control cows Ux Uy 4 Review of Significance Tests - Logically, if Ho is refuted Ha is confirmed - investigator seeks to 'nullify' Ho Expt: 20 cows randomized to BST (X) and control (Y). Measure wt. gain. Calculate mean wt. change per group. 5 Review of Significance Tests Assumptions: i) Sample statistic (X - Y) is one instance of an infinitely large number of sample statistics obtained from an infinite number of replications of the expt., under the same conditions (frequentist assumption) ii) Populations are normally distributed, equal variance iii) The Ho is true 6 Review of Significance Tests (t-test) t X Y S xy N (0, 1) df = (n1 – 1) (n2 – 1) Where: Sxy ( 1 1 ). S 2 n1 n2 = standard error of the difference between two independent means. S2 = estimate of pooled population variance - t may take on any value, no value is logically inconsistent with Ho! Smaller t values are more consistent with Ho being true. - all else equal, larger n’s increase value of t (higher power). 7 Review of Significance Tests Large values of t indicate: i) test assumptions are true, a rare event has occurred ii) one of the assumptions of the test is false, and by convention it is assumed that the Ho is not true. - By convention, relative frequency of t where we decide to choose (ii) above as a logical conclusion is set to 5% (alpha level or significance level) - Expt: t = 2.55, p = 0.02, reject Ho - result is significant 8 Review of Significance Tests - Type 1 error (alpha), occurs 5% of the time when Ho is true - Type II error (beta), occurs B% of the time when Ho is false - Alpha and beta are inversely related - Fixing alpha at 5%, means Sp is 95% - Beta is not set 'a priori‘, hence Se (power) tends to be low - Scientific caution dictates that set alpha small - Scientific ignorance dictates we ignore beta! 9 Alpha and beta are inversely related 10 Relationship between diagnostic test result and disease status DISEASE PRESENT (D+) POSITIVE (T+) TP FP PVP= a a+b TN PVN= d c+d a b c d TEST NEGATIVE (T-) ABSENT (D-) FN Se= a/a + c Sp= d/b + d Se= P(T+|D+) Sp= P(T-|D-) 11 Relationship between significance test results and truth TRUTH REJECT Ho SIGNF. Ho False Ho True TP FP (1 - B) Type I (a) TEST ACCEPT Ho FN TN Type II (B) (1 - a) Se= TP/TP + FN Se= Power (1 - B) PVP= TP TP + FP PVN= TN TN + FN Sp= TN/TN + FP 12 Power - Probability of rejecting Ho when Ho is false - Se = TP/(TP + FN) or (1 - B) - Power is a function of: i) Alpha (increase by making Ha one sided i.e., Ux > Uy) (consistent with changing the cut-off value) ii) Reliability (as measured by SE of the difference) - Power increases with decreasing SE - SE decreases with increasing sample size (= decr variance) iii) Size of treatment effect 13 The Consequences of Low Power i) difficult to interpret negative results - truly no effect - expt unable to detect true difference ii) increase proportion of type 1 errors in literature iii) fail to identify many important associations iv) low power means low precision (indicated by the confidence interval) 14 Questions? • What proportion of statistically significant findings published in the literature are false positive (Type 1) errors? • What well known measure is this proportion? and, what elements does this figure therefore depend on? 15 Hypothetical outcomes of 500 experiments, a= 0.05, Power= 0.50, and 20% prevalence of false Ho’s TRUTH Ho FALSE REJECT Ho Ho TRUE 50 20 50 380 100 400 Se = 50% Sp = 95% PV+ = 50/70 = 71% SIGNF. TEST ACCEPT Ho If all signf. results published, 29% are Type 1 errors N = 500 16 The P value - probability of obtaining a value of the test statistic (X) at least as large as the one observed, given the Ho is true - P (>=X | Ho true) Common Incorrect Interpretations - It is NOT P (Ho true|Data)!!! - We can never state the probability of a hypothesis being true! (under the frequentist approach) - The probability that the results were due to chance! 17 Criticisms of Significance Tests i) Decision vs Inference (Neyman-Pearson) - pioneers of modern statistics were interested in producing results that enabled decisions to be made - problem of automatic acceptance or rejection based on an arbitrary cutoff (P= 0.04 vs P=0.06) - results should adjust your degree of belief in a hypothesis rather than forcing you to accept an artificial dichotomy - "intellectual economy" 18 Criticisms of Significance Tests ii) Asymmetry of significance tests - frequently, the experimental data can be found to be consistent with a Ho of no effect or a Ho of a 20% increase - acceptance of both Ho's given the data leads to 2 very different conclusions! - asymmetry was recognized by Fisher, hence convention is to identify theory with the Ha but to test the Ho - Is there an effect? is the wrong question! Should ask: What is the size of the effect? 19 Criticisms of Significance Tests iii) Corroborative power of significance tests - Both Fisherian and Neyman-Pearson schools make no assumption about the prior probability of Ho - Both schools presume Ho is almost always false - rejection of Ho does nothing to illuminate which of the vast number of Ha’s are supported by the data! - Failing to reject Ho does not prove Ho is true (Popper: 'we can falsify hypotheses but not confirm them') 20 Criticisms of Significance Tests iv) Effect size and significance tests - Test statistics and p values are a function of both effect size and sample size - Cannot infer size of an effect by inspection of the P value reporting P< 0.00001 has no scientific merit! - Highly significant results may be derived from trivial effects if sample size is large. - Confidence intervals give plausible range for the unknown popl parameter (signf tests show what the parameter is not!) 21 Relationship between the Size of the Sample and the Size of the P Value • Example RCT: • Intervention: new a/b for pneumonia. • Outcome: Recovery Rate = % of patients in clinical recovery by 5 days • Facts: • Known = Existing drug of choice results in 35% recovery rate at 5 days • Unknown = New drug improves recovery rate by 5% (to 40%) 22 P values Generated by RCT by Sample Size Sample Size (N = 2x) P value (Chi-square) 100 0.465 500 0.103 600 0.074 700 0.053 800 0.039 1000 0.021 23 Conclusion? Significance testing should be abandoned and replaced with interval estimation (point estimate and CI)! Why? - not couched in pseudo-scientific hypothesis testing language - do not imply any decision making implications - give plausible range to unknown popl parameter - gives clue as to sample size (width of the CI) - avoids danger of inferring a large effect when result if highly significant 24 Interval estimation - view "experimentation" as a measurement exercise - want an unbiased, precise measure of effect - Point estimate: best estimate of the true effect, given the data (aka MLE) and it indicates the magnitude of effect (but is imprecise) - Confidence intervals indicate degree of precision of estimate. Represent a set of all possible values for the parameter that are consistent with the data - width of CI depends on variability and level of confidence (%) 25 Interval estimation - 90% CI: - 90% of such intervals will include the true unknown popl. parameter (necessary frequentist interpretation) - it does not represent a 90% probability of including the true unknown popl. parameter within it - CIs indicate magnitude and precision. - CI are linked to alpha and hypothesis testing (1 - alpha) = 95% 26 Interval estimation - Example OUTCOME + - TRT A 7 13 20 P(success)= 35% TRT B 14 6 20 P(success)= 70% Significance test: P= 0.06 or NS! Interval estimation of difference: 35% (95%CI = -1,+71%) 27 Confidence Intervals - CI are non-uniform, true parameter is more likely to be located centrally than near to limits. Therefore precise location of boundary is irrelevant! - For a study to be reassuring about a lack of effect, boundaries of CI should be near the null value - CIs have clear advantages over the p-value but still suffer from the necessary frequentist interpretation (a CI represents one member of a family of CIs produced by an infinite number of replications of the same experiment) - CI functions 28 Which is the more important study? Study A Study B larger effect null point 29 Importance of Beta (Type II error) and Sample Size in RCT’s (Freiman et al 1978) • Reviewed 71 “negative’ (P > 0.05) RCT published from 1960-77 • Assume 25% treatment effect: • 94% (N= 67) of trials had < 90% power • Only 15% (N= 10) had sufficient evidence to conclude no effect • Assume 50% treatment effect: • 70% (N= 50) of trials had < 90% power • Only 32% (N= 16) had sufficient evidence to conclude no effect 30 The P Value Fallacy - Goodman • Derives from the simultaneous application of the p-value as: • A long-run, error based, deductive tool (Neyman Pearson frequentist application), and • A short-run, evidential and inductive tool (i.e., what is the meaning of this particular result?) • The p-value was never designed to serve these two conflicting roles 31 The Bayes Factor - Goodman • Comparison of how well two hypotheses predict the data: P (Data | given the Ho) P (Data | given the Ha) • Allows explicitly the incorporation of external evidence (in terms of prior probability/belief) • Use of Bayesian statistics shows that weight of evidence against the Ho is not as strong as the pvalue suggests (Table 2) 32