Chapter 6 Introduction to Statistical Inference Introduction • Goal: Make statements regarding a population (or state of nature) based on a sample of measurements • Probability statements used to substantiate claims • Example: Clinical Trial for Pravachol (5-year follow-up) – Of 3302 subjects receiving Pravachol, 174 had heart incidences – Of 3293 subjects receiving placebo, 248 had heart incidences 174 p Pravachol .0527 (5.27%) 3302 ^ 248 p placebo .0753 (7.53%) 3293 Probabilit y that Pravachol would do this much better if not effective : .000088 Approximat ely one chance in 11363 ^ Estimating with Confidence • Goal: Estimate a population mean (proportion) based on sample mean (proportion) • Unknown: Parameter (m, p) • Known: Approximate Sampling Distribution of Statistic X ~ N m, n p ~ N p, ^ p (1 p ) n • Recall: For a random variable that is normally distributed, the probability that it will fall within 2 standard deviations of mean is approximately 0.95 P m 2 X m2 0.95 n n p(1 p) ^ p(1 p) 0.95 P p 2 p p2 n n Estimating with Confidence • Although the parameter is unknown, it’s highly likely that our sample mean or proportion (estimate) will lie within 2 standard deviations (aka standard errors) of the population mean or proportion (parameter) • Margin of Error: Measure of the upper bound in sampling error with a fixed level (we will use 95%) of confidence. That will correspond to 2 standard errors: Mean : Margin of Error (95% Confidence ) : 2 n Proportion : Margin of Error (95% Confidence ) : 2 Confidence Interval : estimate margin of error p(1 p) n Confidence Interval for a Mean m • Confidence Coefficient (C): Probability (based on repeated samples and construction of intervals) that a confidence interval will contain the true mean m • Common choices of C and resulting intervals: 90% Confidence : x 1.645 95% Confidence : x 1.960 99% Confidence : x 2.576 C % Confidence : x z * n n n n C 90% 95% 99% z* 1.645 1.960 2.576 Normal Distribution 1 C 2 1 C 2 C mz * n m m z* n Standard Normal Distribution 1 C 2 1 C 2 C z * 0 z* Philadelphia Monthly Rainfall (1825-1869) 1 Histogram 2 3 140 4 5 10 0 11 e 20 12 M or 9 15 40 13 8 11 60 9 7 7 80 5 6 3 100 1 Frequency 120 13 14 15 m 3.68 1.92 Margin of error (n 20, C 95%) : 1.96 1.92 0.84 20 4 Random Samples of Size n=20, 95% CI’s Sample 1 Month 156 51 176 364 271 7 312 219 16 484 316 318 517 249 445 13 479 370 348 89 Mean Mean-me Mean+me Rain 2.56 2.87 4.64 2.05 2.76 2.06 4.51 4.41 3.87 2.83 4.56 3.44 3.62 2.16 4.79 1.11 3.93 4.11 2.17 5.40 3.39 2.55 4.23 Ran# 0.0028 0.0050 0.0052 0.0082 0.0142 0.0145 0.0153 0.0160 0.0171 0.0190 0.0202 0.0257 0.0272 0.0301 0.0320 0.0324 0.0325 0.0345 0.0374 0.0380 Sample 2 Month 349 149 227 336 124 330 468 293 511 235 314 372 164 48 236 50 39 417 503 151 Rain 2.33 4.86 4.15 5.17 4.33 4.03 4.63 3.99 2.39 5.28 3.11 5.42 2.78 0.26 2.40 3.75 3.35 7.68 1.76 5.89 3.88 3.04 4.72 Ran# 0.0007 0.0013 0.0054 0.0073 0.0081 0.0101 0.0132 0.0145 0.0149 0.0172 0.0190 0.0260 0.0272 0.0281 0.0284 0.0319 0.0325 0.0333 0.0359 0.0361 Sample 3 Month 185 527 114 312 49 398 396 99 181 364 392 477 434 229 223 279 520 245 183 41 Rain 2.69 5.28 3.99 4.51 5.37 2.29 5.55 2.22 1.84 2.05 7.59 7.16 2.07 4.05 4.54 2.76 5.44 1.60 2.63 3.49 3.86 3.02 4.70 Ran# 0.0005 0.0029 0.0048 0.0084 0.0085 0.0166 0.0187 0.0233 0.0235 0.0244 0.0253 0.0283 0.0290 0.0318 0.0320 0.0364 0.0374 0.0374 0.0391 0.0395 Sample 4 Month 171 175 130 167 101 33 299 337 447 78 117 399 52 162 95 479 51 380 61 302 m 3.68 1.92 Margin of error (n 20, C 95%) : 1.96 Rain 1.50 2.52 1.22 3.35 5.88 0.79 2.60 1.85 3.55 3.53 3.57 1.09 4.99 6.60 2.59 3.93 2.87 6.00 1.63 2.87 3.15 2.31 3.99 Ran# 0.0011 0.0048 0.0085 0.0094 0.0133 0.0148 0.0164 0.0191 0.0193 0.0213 0.0224 0.0227 0.0240 0.0261 0.0296 0.0296 0.0303 0.0311 0.0324 0.0339 1.92 0.84 20 Factors Effecting Confidence Interval Width • Goal: Have precise (narrow) confidence intervals – Confidence Level (C): Increasing C implies increasing probability an interval contains parameter implies a wider confidence interval. Reducing C will shorten the interval (at a cost in confidence) – Sample size (n): Increasing n decreases standard error of estimate, margin of error, and width of interval (Quadrupling n cuts width in half) – Standard Deviation (): More variable the individual measurements, the wider the interval. Potential ways to reduce are to focus on more precise target population or use more precise measuring instrument. Often nothing can be done as nature determines Selecting the Sample Size • Before collecting sample data, usually have a goal for how large the margin of error should be to have useful estimate of unknown parameter (particularly when comparing two populations) • Let m be the desired level of the margin of error and be the standard deviation of the population of measurements (typically will be unknown and must be estimated based on previous research or pilot study • The sample size giving this margin of error is: z mz n n m * * 2 Precautions • Data should be simple random sample from population (or at least can be treated as independent observations) • More complex sampling designs have adjustments made to formulas (see Texts such as Elementary Survey Sampling by Scheaffer, Mendenhall, Ott) • Biased sampling designs give meaningless results • Small sample sizes from nonnormal distributions will have coverage probabilities (C) typically below the nominal level • Typically is unknown. Replacing it with sample standard deviation s works as a good approximation in large samples Significance Tests • Method of using sample (observed) data to challenge a hypothesis regarding a state of nature (represented as particular parameter value(s)) • Begin by stating a research hypothesis that challenges a statement of “status quo” (or equality of 2 populations) • State the current state or “status quo” as a statement regarding population parameter(s) • Obtain sample data and see to what extent it agrees/disagrees with the “status quo” • Conclude that the “status quo” is not true if observed data are highly unlikely (low probability) if it were true Pravachol and Olestra • Pravachol vs Placebo wrt heart disease/death – Pravachol: 5.27% of 3302 patients suffer MI or death to CHD – Placebo: 7.53% of 3293 patients suffer MI or death to CHD – Probability of difference this large for Pravachol if no more effective than placebo is .000088 (will learn formula later) • Olestra vs Triglyceride Chips wrt GI Symptoms – Olestra: 15.81% of 563 subjects report GI symptoms – Triglyceride: 17.58% of 529 subjects report GI symptoms – Probability of difference this large in either direction (olestra better or worse) is .4354 • Strong evidence of Pravachol effect vs placebo • Weak to no evidence of Olestra effect vs Triglyceride Elements of a Significance Test • Null hypothesis (H0): Statement or theory being tested. Will be stated in terms of parameters and contain an equality. Test is set up under the assumption of its truth. • Alternative Hypothesis (Ha): Statement contradicting H0. Will be stated in terms of parameters and contain an inequality. Will only be accepted if strong evidence refutes H0 based on sample data. May be 1-sided or 2-sided, depending on theory being tested. • Test Statistic (TS): Quantity measuring discrepancy between sample statistic (estimate) and parameter value under H0 • P-value: Probability (assuming H0 true) that we would observe sample data (test statistic) this extreme or more extreme in favor of the alternative hypothesis (Ha) Example: Interference Effect • Does the way items are presented effect task time? – – – – – – Subjects shown list of color names in 2 colors: different/black Xi is the difference in times to read lists for subject i: diff-blk H0: No interference effect: mean difference is 0 (m = 0) Ha: Interference effect exists: mean difference > 0 (m > 0) Assume standard deviation in differences is = 8 (unrealistic*) Experiment to be based on n=70 subjects Parameter value under H 0 : m 0 Approximat e Distributi on of sample mean under H 0 : X ~ N (0, n 8 0.96) 70 Observed sample mean : x 2.39 How likely to observe sample mean difference 2.39 if m = 0? Sampling Distribution of X-bar P-value 0 2.39 Computing the P-Value • 2-sided Tests: How likely is it to observe a sample mean as far of farther from the value of the parameter under the null hypothesis? (H0: m m0 Ha: m m0) X m0 Under H 0 : X ~ N m 0 , ~ N (0,1) Z n n After obtaining the sample data, compute the mean and convert it to a z-score (zobs) and find the area above |zobs| and below -|zobs| from the standard normal (z) table • 1-sided Tests: Obtain the area above zobs for upper tail tests (Ha:m > m0) or below zobs for lower tail tests (Ha:m < m0) Interference Effect (1-sided Test) • Testing whether population mean time to read list of colors is higher when color is written in different color • Data: Xi: difference score for subject i (Different-Black) • Null hypothesis (H0): No interference effect (m = 0) • Alternative hypothesis (Ha): Interference effect (m > 0) • “Known”: n=70, = 8 (This won’t be known in practice but can be replaced by sample s.d. for large samples) Sample Data : x 2.39 s 7.81 n 70 2.39 0 2.39 Test Statistic (Based on 8) : zobs 2.49 8 0.96 70 2.39 0 2.39 Test Statistic (Based on s 7.81) : zobs 2.57 7 . 81 0 . 93 70 P - value (Based on 8) : P( Z 2.49) 1 .9936 .0064 P - value (Based on s 7.81) : P( Z 2.57) 1 .9949 .0051 Interference Effect (2-sided Test) • Testing whether population mean time to read list of colors is effected (higher or lower) when color is written in different color • Data: Xi: difference score for subject i (Different-Black) • Null hypothesis (H0): No interference effect (m = 0) • Alternative hypothesis (Ha): Interference effect (+ or -) (m 0) • “Known”: n=70, = 8 (This won’t be known in practice but can be replaced by sample s.d. for large samples) Sample Data : x 2.39 s 7.81 n 70 2.39 0 2.39 Test Statistic (Based on 8) : zobs 2.49 8 0.96 70 2.39 0 2.39 Test Statistic (Based on s 7.81) : zobs 2.57 7 . 81 0 . 93 70 P - value (Based on 8) : 2 P( Z | 2.49 |) 2(1 .9936) .0128 P - value (Based on s 7.81) : 2 P( Z | 2.57 |) 2(1 .9949) .0102 Equivalence of 2-sided Tests and CI’s • For a = 1-C, a 2-sided test conducted at a significance level will give equivalent results to a C-level confidence interval: – If entire interval > m0, P-value < a , zobs > 0 (conclude m > m0) – If entire interval < m0, P-value < a , zobs < 0 (conclude m < m0) – If interval contains m0, P-value > a (don’t conclude m m0) • Confidence interval is the set of parameter values that we would fail to reject the null hypothesis for (based on a 2sided test) Decision Rules and Critical Values • Once a significance (a) level has been chosen a decision rule can be stated, based on a critical value: • 2-sided tests: H0: m = m0 Ha: m m0 – If test statistic (zobs) > za/2 Reject Ho and conclude m > m0 – If test statistic (zobs) < -za/2 Reject Ho and conclude m < m0 – If -za/2 < zobs < za/2 Do not reject H0: m = m0 • 1-sided tests (Upper Tail): H0: m = m0 Ha: m > m0 – If test statistic (zobs) > za Reject Ho and conclude m > m0 – If zobs < za Do not reject H0: m = m0 • 1-sided tests (Lower Tail): H0: m = m0 Ha: m < m0 – If test statistic (zobs) < -za Reject Ho and conclude m < m0 – If zobs > -za Do not reject H0: m = m0 Potential for Abuse of Tests • Should choose a significance (a) level in advance and report test conclusion (significant/nonsignificant) as well as the P-value. Significance level of 0.05 is widely used in the academic literature • Very large sample sizes can detect very small differences for a parameter value. A clinically meaningful effect should be determined, and confidence interval reported when possible • A nonsignificant test result does not imply no effect (that H0 is true). • Many studies test many variables simultaneously. This can increase overall type I error rates Large-Sample Test H0:m1-m2=0 vs H0:m1-m2>0 • H0: m1-m2 = 0 (No difference in population means • HA: m1-m2 > 0 (Population Mean 1 > Pop Mean 2) T .S . : zobs x1 x 2 s12 s22 n1 n2 R.R. : zobs za P value : P ( Z z obs ) • Conclusion - Reject H0 if test statistic falls in rejection region, or equivalently the P-value is a Example - Botox for Cervical Dystonia • Patients - Individuals suffering from cervical dystonia • Response - Tsui score of severity of cervical dystonia (higher scores are more severe) at week 8 of Tx • Research (alternative) hypothesis - Botox A decreases mean Tsui score more than placebo • Groups - Placebo (Group 1) and Botox A (Group 2) • Experimental (Sample) Results: x1 10.1 s1 3.6 n1 33 x 2 7.7 s2 3.4 n2 35 Source: Wissel, et al (2001) Example - Botox for Cervical Dystonia Test whether Botox A produces lower mean Tsui scores than placebo (a = 0.05) H 0 : m1 m 2 0 H A : m1 m 2 > 0 10.1 7.7 2. 4 T .S . : zobs 2.82 2 2 0.85 (3.6) (3.4) 33 35 R.R. : zobs za z.05 1.645 P val : P ( Z 2.82) .0024 Conclusion: Botox A produces lower mean Tsui scores than placebo (since 2.82 > 1.645 and P-value < 0.05) 2-Sided Tests • Many studies don’t assume a direction wrt the difference m1-m2 • H0: m1-m2 = 0 HA: m1-m2 0 • Test statistic is the same as before • Decision Rule: – Conclude m1-m2 > 0 if zobs za/2 (a=0.05 za/2=1.96) – Conclude m1-m2 < 0 if zobs -za/2 (a=0.05 -za/2= -1.96) – Do not reject m1-m2 = 0 if -za/2 zobs za/2 • P-value: 2P(Z |zobs|) Power of a Test • Power - Probability a test rejects H0 (depends on m1- m2) – H0 True: Power = P(Type I error) = a – H0 False: Power = 1-P(Type II error) = 1-b · Example: · H0: m1- m2 = 0 HA: m1- m2 > 0 12 = 22 25 n1 = n2 = 25 · Decision Rule: Reject H0 (at a=0.05 significance level) if: zobs x1 x 2 2 1 n1 2 2 n2 x1 x 2 1.645 2 x1 x 2 2.326 Power of a Test • Now suppose in reality that m1-m2 = 3.0 (HA is true) • Power now refers to the probability we (correctly) reject the null hypothesis. Note that the sampling distribution of the difference in sample means is approximately normal, with mean 3.0 and standard deviation (standard error) 1.414. • Decision Rule (from last slide): Conclude population means differ if the sample mean for group 1 is at least 2.326 higher than the sample mean for group 2 • Power for this case can be computed as: P( X 1 X 2 2.326) X 1 X 2 ~ N (3, 2.0 1.414) Power of a Test 2.326 3 Power P( X 1 X 2 2.326) P( Z 0.48) .6844 1.41 • All else being equal: • As sample sizes increase, power increases • As population variances decrease, power increases • As the true mean difference increases, power increases Power of a Test Distribution (H0) Distribution (HA) Power of a Test Power Curves for group sample sizes of 25,50,75,100 and varying true values m1-m2 with 1=2=5. • For given m1-m2 , power increases with sample size • For given sample size, power increases with m1-m2 Sample Size Calculations for Fixed Power • Goal - Choose sample sizes to have a favorable chance of detecting a clinically meaning difference • Step 1 - Define an important difference in means: – Case 1: approximated from prior experience or pilot study - dfference can be stated in units of the data – Case 2: unknown - difference must be stated in units of standard deviations of the data m1 m 2 • Step 2 - Choose the desired power to detect the the clinically meaningful difference (1-b, typically at least .80). For 2-sided test: 2(za / 2 z b 2 n1 n2 2