Presentation 8 First Part Introduction to Inference: Confidence Intervals and Hypothesis Testing What is inference? Inference is when we use a sample to make conclusions about a population. 1. Draw a Representative SAMPLE from the POPULATION 2. Describe the SAMPLE90 Var 1 Var 2 Va 3 80 70 459 Brown 28 657 Red 43 60 50 40 30 321 Green 46 213 Blue 47 536 Blue 53 3. Use Rules of Probability and Statistics to make Conclusions about the POPULATION from the SAMPLE. East West North 20 10 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Population Parameters p = population proportion µ = population mean σ = population standard deviation β1 = population slope (we will see this in Ch. 14) Sample Statistics p̂ = sample proportion x = sample mean s = sample standard deviation b1 = sample slope (we will see this in Ch. 14) Two Types of Inference 1. Confidence Intervals: (Ch. 10 & 12) – Confidence Intervals give us a range in which the population parameter is likely to fall. – We use confidence intervals whenever the research question calls for an estimation of a population parameter. Example: What is the mean age of trees in the forest? Estimate the proportion of US adults who would vote for candidate A. 2. Hypothesis Testing: (Ch. 11 & 13) – Hypothesis tests are tests of population parameters. Example: Is the proportion of US adult women who would vote for candidate A >50%? – We can only prove that a population parameter is ‘different’ than our null value. We cannot prove that a population parameter is equal to some value. Valid Hypothesis: Is the mean age of trees in the forest > 50 years? Invalid Hypothesis: Is the mean age of trees in the forest equal to 50 years? Types of CIs and Hypothesis Tests For Hypothesis Tests and C.I.’s: 1-proportion (1-categorical variable) 1-mean (1-quantitative variable) Difference in 2 proportions (2-categorical variables, both with 2 levels) Difference in 2 means (1-quantitative and 1categorical variable, or 2-quantitative variables, independent samples) Regression, Slope (2-quantitative variables) For Hypothesis Tests only: Chi-Square Test (2-categorical variables, at least one with 3 or more levels!) Some Examples… Mike wants to estimate the mean high-school GPA of incoming freshman at Penn State. Solution- CI for one population mean. George wants to know if the proportion of students who engage in under age drinking is greater than 25%. Solution- Test of one proportion Ho: p ≤ .25 Ha: p > .25 Doug wants to estimate the difference in the proportion of men and women who smoke. Solution- CI for difference in 2-proportions. Interpreting CI and Hypothesis Testing Confidence Intervals: Given the confidence level, β= 90%, 95%, 99%, etc conclude that with β % confidence the population parameter is within the confidence interval. Example: Suppose the 90% CI for age of trees in the forest is (32,45) years. Then, we are 90% confident that the true mean age of trees in the forest is between 32 and 45 years. Hypothesis Testing: Use the p-value to determine whether we can reject the null hypothesis. We do not need to know the exact definition now, or how to calculate the p-value, but generally the p-value is a measure of how consistent the data is with the null hypothesis. A small p-value (<.05) indicates the data we obtained was UNLIKELY under the null hypothesis. Decision Rule: If the p-value is <.05 we REJECT the null hypothesis, and accept the alternative. We have a statistically significant result! If the p-value is >.05 then we say that we do NOT have enough evidence to reject the null hypothesis. Second Part Confidence Intervals for 1-Proportion Review of Ch.9: Sample Proportion Mean of pˆ : E ( pˆ ) p Std.Dev.of pˆ : sd ( pˆ ) p(1 p) n Standard Error of pˆ : se( pˆ ) pˆ (1 pˆ ) n If np and n(1-p) are greater or equal to 10, the sampling distribution of p̂ is approximately normal with mean p and standard deviation p(1 p) . n From Sampling Distributions to Confidence Intervals… The sample proportion will fall close to the true proportion. Thus the true proportion is likely to be close to the observed sample proportion. How close? 95% of the p̂ would be expected to fall within ± 2 standard deviations of the true proportion p. So if we were to construct intervals around p̂ ‘s with a width of ± 2 standard deviations these intervals would contain the TRUE population proportion 95% of the times! Margin of Error & C.I. is an estimator of p but it is not exactly equal to p. p̂ How far is p̂ from p? – Margin of Error is a measure of accuracy providing a likely upper limit for the difference between p̂ and p. – This difference is almost always less that the Margin of Error. – The almost always is translated with large probability. Usually we are talking about 90%, 95% or 99% probability. – This probability is the confidence level. For example, if the confidence level is 95%, it means that 95% of the times the difference between p̂and p is less than the Margin of Error. (i.e. we expect 38 out of 40 samples to give a p̂ such that its difference with p is less than the Margin of Error.) Example: Based on a sample of 1000 voters, the proportion of voters who favor candidate A are 34% with a 3% Margin of Error based on a 95% confidence level. What does this tell us? 95% C.I. for 1-proportion (Derivation) If np and n(1-p) are ≥ 10, the sampling distribution of p̂ is ˆ ). approximately normal with mean p and standard deviation sd ( p From the empirical rule we have that for about 95% of the samples, ˆ ) from p, i.e. with 95% probability we is going to fall within 2 sd ( p have p 2sd ( pˆ ) pˆ p 2sd ( pˆ ) 2sd ( pˆ ) pˆ p 2sd ( pˆ ) p(1 p) p(1 p) ˆ 2 p p2 n n There is a problem here! Since p is the unknown parameter of ˆ ) is also unknown. Thus, we substitute sd ( pˆ ) with interest, sd ( p ˆ ). Doing so we have that if npˆ and n(1 pˆ ) are both ≥10, the se ( p then with 95% probability we have 2se( pˆ ) pˆ p 2se( pˆ ) 2 pˆ (1 pˆ ) pˆ (1 pˆ ) ˆ p p2 n n 95% Margin of Error and C.I. for p Thus, if npˆ and n(1 pˆ ) the 95% Margin of Error is pˆ (1 pˆ ) 2 se( pˆ ) 2 pˆ and the 95% C.I. for p is Sample Statistic Margin of Error pˆ (1 pˆ ) pˆ 2 se( pˆ ) pˆ 2 pˆ Note that we are using p̂ instead of p for the condition! Example 1: Obtaining a 95% C.I. for p. A sample of 1200 people is polled to determine the percentage that are in favor of candidate A. Suppose 580 say they are in favor. Construct a 95% CI for the true population proportion. p̂ 580/1200 .483 p̂(1 p̂) .483(1 - .483) se(p̂) .0144 n 1200 So the 95% CI for p is: p̂ 2 se(p̂) .483 2(.0114) (.455,.512) Conclusion: We are 95% confident that the true population proportion of those who support candidate A is between 45.5% and 51.2%. Any C.I. for 1-proprtion Conditions: We need to have β% CI for p : npˆ 10 and n(1 pˆ ) 10. pˆ z * se( pˆ ) Margin of Error=z* times the std. error – z* multiplier depends on the desired confidence level, β%. – z* is such that P(-z*<Z<z*)= β%. The most common multipliers are Conf. level, β%. Multiplier, z* 90 1.64 95 1.96 ≈ 2 98 2.33 99 2.58 Interpretation: We are β% confident that the true population proportion, p, is contained within the confidence interval. Another interpretation is that for about β% samples from the population, the CI captures p. Example 2: Obtaining a 99% C.I. for p. 300 high-risk patients received an experimental AIDS vaccine. The patients were followed for a period of 5 years and ultimately 53 came down with the virus. Assuming all patients were exposed to the virus, construct a 99% CI for the proportion of individuals protected. We have that the 99% CI for p is: pˆ z * se( pˆ ) where z*= 2.58. (Can you see why using the Normal table?) 247 .823 300 ˆ (1 p ˆ) p ˆ se( p ) n ˆ p .823(1 .823) .0220 300 So the 99% CI for p = .823 ± 2.58(.0220) = (.767,.880) We are 99% confident that the true proportion of those protected by the vaccine is between 76.7% and 88.0%. The Width of a Confidence Interval is affected by: n as the sample size increases the standard error of p̂ decreases and the confidence interval gets smaller. So a larger sample size gives us a more precise estimate of p. z* as the confidence level increases (β%), the multiplier z* increases, leading to a wider CI. So, if we want to control the length of the C.I. we can either adjust the confidence level or the sample size... Question: What is an appropriate size in order to obtain a C.I. of a 95% confidence level that is not very large (i.e. with small Margin of Error)? The Margin of Error for 95% CI is equal to 2 x s.e( p̂). Before collecting the sample, p̂ is unknown, thus we cannot calculate the exact Margin of Error. A conservative Margin of Error is equal to 1 n pˆ (1 pˆ ) 1/ 2 1 1 noting that 2 2 , since pˆ (1 pˆ ) n n n 4 This implies that p̂ differs from p at most ___________ . Using the conservative Margin of Error, the length of the C.I. is equal to _____________. How large should n be to get a 95% CI of some length L? n=___________.