Chapter 8: Hypothesis Testing for Population Proportions The basics of Significance Testing Statistical Inference • Already discussed confidence intervals for unknown population parameter, p • Confidence Intervals used when the goal is to estimate an unknown population parameter like ρ (like when we estimated the true proportion of all 5,000 COC students who have at least one tattoo) • This chapter... statistical inference through significance tests • Evaluate evidence (a statistic) provided by sample data about some claim concerning an unknown population parameter like ρ The Main Ingredients of Hypothesis Testing • There once were four students who missed the midterm for their statistics class. They went to the professor together and said, “Please let us make up the exam. We carpool together, and on our way to the exam, we got a flat tire. That’s why we missed the exam.” The professor didn’t believe them, but instead of arguing he said, “Sure, you can make up the exam. Be in my office tomorrow at 8.” The Main Ingredients of Hypothesis Testing • The next day, they met in his office. He sent each student to a separate room and gave them an exam. The exam consisted of only one question: “Which tire?” • Let’s imaging all four students answered, “left rear tire.” • So... what do you think? Were students most likely telling the truth? Lying? The Main Ingredients of Hypothesis Testing • Surprised or not? • The professor suspected they had been lying. That’s why he did what he did. • Maybe they just got lucky ... just by chance they all guess the same tire. • The probability that all four students would guess the same tire is only about 1.6%. • Do you consider that likely/typical or unlikely/rare that they could have just simply, by chance guessed the same tire? I’m a great free-throw shooter... • I claim that in the last 5 years of playing basketball, I, on average, make 95% of my basketball free throws. • To test my claim, I am asked to shoot 50 free throws. I make only 10 of the 50 (only 20%). • Do you still believe my claim that I make 95% of my free throw? Why or why not? I claim 95% ... I actually made only 20%... • Do you still believe my claim that I make 95% of my free throw? Why or why not? • Do you agree that statistics vary from sample to sample? So if I attempted another 50 free throws, chances are I would make something other than 10 of them? • So the question is... would me actually making only 10 out of 50 (or 20%) happen so, so very rarely (assuming my 95% claim were true) that now you are starting to question/doubt if my claim really is true. Hypothesis Testing or Significance Testing ... • A formal procedure that enables us to choose between two hypotheses when we are uncertain about our measurements. • Basic idea... An outcome that would rarely happen if a claim were really true is good evidence that the claim is not true. • Example... I claim that 99% of adult humans are 6 feet tall or taller. • So, I’m going to take a SRS of 1000 humans and measure their heights. Then I calculate the sample mean height and get 5’ 8”. Do you agree with my claim or disagree with my claim, based on my sample statistic? • If my claim was true, it would be very rare to get most of the adult humans in a SRS of 1000 that are shorter than 6 feet. OR on the flip side, it would be very common for me to get humans in a SRS of 1000 that are 6 feet tall or taller. Start with: A Pair of Hypotheses • If we flip a penny, we can agree that the probability of heads or tails is 0.50; fair. • However, some claim if we spin a penny on a table, because the heads side bulges outwards, the lack of symmetry will cause the spinning coin to land on one side more often than the other; probability is not 0.50 for each side; unfair • Some people might find this claim outrageous; completely false Start with a research hypothesis... Null hypothesis, Ho, p = 0.50 -null hypothesis is always neutral, no change, always = -null hypothesis is always in terms of population parameter (like p or μ) Alternative hypothesis, Ha, p ≠ 0.50 - alternative hypothesis is always <, >, or ≠ -alternative hypothesis is always in terms of population parameter (like p or μ) Like in a criminal trial... Null hypothesis, Ho, p = 0.50 - In the beginning, we assume null is true (like defendant is assumed not guilty in the beginning of a trial) until there is overwhelming evidence that suggests this is not so; then we may reject this believe if/when the evidence is clearly against it Alternative hypothesis, Ha, p ≠ 0.50 Null Hypothesis... Ho ... • The null hypothesis always gets the benefit of the doubt and is assumed to be true throughout the hypothesis-testing procedure. If we decide at the last step that the observed outcome is extremely unusual under this assumption, then and only then do we reject the null hypothesis. Ho: p = 0.50 Ha: p ≠ 0.50 • If null hypothesis is correct, then when we spin a coin a number of times, about ½ of the outcomes should be heads. If null hypothesis is wrong, we will see either a much larger or much smaller proportion. • Let’s spin some pennies. Spin (on desk) 20 times. Count the # of heads • Calculate the sample proportion, write on board pˆ of heads and Ho: p = 0.50 Ha: p ≠ 0.50 • Let’s look at our sampling distribution; describe using SOCS (review) • If we did this again, would be get different results? • How ‘extreme’ of a result would we need for you to not believe our null hypothesis/to reject null? • We will come back to this later in the chapter... Practice with null and alternative hypotheses... • What’s wrong with ... • • • • • Ho: Ho: Ho: Ho: Ho: p = 0.17 p = - 0.20 p > 0.45 p = 1.50 pˆ = 0.92 Ha: Ha: Ha: Ha: Ha: p ≠ 0.19 p < 0.15 p = 0.45 p > 1.50 pˆ < 0.92 Practice: State the appropriate null hypothesis and alternative hypothesis in each case. Be sure to define your parameter each time. A recent Gallup Poll report on a national survey of 1028 teenagers revealed that 72% of teens said they rarely or never argue with their friends. You wonder whether this national result would be different in your school. So you conduct your own survey of a random sample of students at your school. Practice: State the appropriate null hypothesis and alternative hypothesis in each case. Be sure to define your parameter each time. The proportion of people who live after suffering a stroke is 0.85. A drug manufacturer has just developed a new treatment that they claim will increase the survival rate. Explain what is wrong in each situation and why it is wrong A change is made that should improve student satisfaction with the parking situation at COC. The null hypothesis, that there is an improvement, is tested versus the alternative, that there is no change. Explain what is wrong in each situation and why it is wrong • A researcher tests the following null hypothesis • Ho: pˆ = 0.80 Explain what is wrong in each situation and why it is wrong A statistics instructor at COC read that 90% of all college students use social media on a regular basis. She wonders if the percent of COC students who use social media on a regular basis is different. Ho: p = 0.90 Ha: p > 0.91 Explain what is wrong in each situation and why it is wrong The Census Bureau reports that households spend an average of 31% of their total spending on housing. A homebuilders association in Cleveland believes that this average is lower in their area. They interview a sample of 40 households in the Cleveland metropolitan area to learn what percent of their spending goes toward housing. Take p to be the mean percent of spending devoted to housing among all Cleveland households. H0: p = 31% Ha: p < 31% The Main Ingredient: Surprise • Surprise itself; when something unexpected occurs (like only making 20% of free throws when we claimed to make 95%) • Null hypothesis tells us what to expect; it’s what we believe throughout the process until we see evidence otherwise • If we see something unexpected, then we should doubt the null hypothesis • If we are really surprised, then we should rejected it altogether Let’s go back to our penny spinning... We have a way to measure our “surprise”... • Instead of just not surprising, kind of surprising, very surprising, etc., we have... • p-value • A p-value is a probability. Assuming the null hypothesis is true, the p-value is the probability that if the experiment were repeated many times, we would get as extreme or more extreme outcome than the one we actually got (our statistic). A small p-value suggests that a surprising outcome has occurred and discredits the null hypothesis. P-Values A p-value is a quantitative measure of rarity of/how unlikely a finding Small p-values are evidence against Ho Large p-values fail to give evidence against Ho P-value... all about extremes... • Understanding how to interpret a p-value is crucial to understanding hypothesis testing. • Minitab will calculate the p-value, but we need to understand how the computer did the calculation • The meaning of the phrase, “as extreme as or more extreme than’ depends on the alternative hypothesis P-value... all about extremes... Three basic pairs of hypotheses... Let’s go back to spinning coins... Ho: p = 0.50 Ha: p ≠ 0.50 • Note: the closer the number of heads is to 10, the larger the p-value • Also note the p-value for an outcome of 11 heads is the same as for 9 heads, etc. Statistical Significance... • Most of the time, we take one more step to assess evidence against Ho • We compare the p-value to some predetermined value (versus ‘unlikely’) called a significance level, symbol α (alpha) • Can think of this as a rejection zone (sketch) Statistical Significance • Significance level makes ‘not likely’ more exact, more informative • Most common α levels are α = 0.05 or α = 0.01 • Interpretation: – At α = 0.05, data give evidence against Ho so strong it would happen no more than 5% of the time Statistical Significance • If p-value is as small or smaller than α, we say data are statistically significant at level α • Note: ‘significant’ in statistics doesn’t mean important (like in English); it means not likely to happen by chance Let’s sketch some pictures of rejection zones and p-values... • Ho: p = ... Ha: p ... • I gathered sample data, and calculated a pvalue based on sample data (probability of getting that value or more extreme assuming that null hypothesis is true) • 1-sided • 2-sided Statistically Significant Sketches • If p-value is p = 0.03... this is significant at α = 0.05 level (in rejection zone) • If p-value is p = 0.03... this is not significant at α = 0.01 level (not in rejection zone) Interpretation/Wording Reject Ho (Null Hypothesis): This happens when sample statistic is statistically significant, p-value is too unlikely to have occurred by chance (we don’t believe null hypothesis), in the rejection zone Wording must reference all of the following for a complete interpretation... p-value, α level, reject Ho, and conclusion in context (caution about using the word ‘cause’ or ‘prove’). Interpretation/Wording Fail to Reject Ho (Null Hypothesis): This happens when sample statistic could have occurred by chance (we do believe null hypothesis; we don’t believe the alternative), not in rejection zone Wording must reference all of the following for a complete interpretation... p-value, α level, fail to reject Ho, and conclusion in context (caution about using the word ‘cause’ or ‘prove’) Conditions for Tests about a population proportion... • Random Sample ... randomly selected or randomly assigned • Large Sample Size; Normality (see next slide) ... npo ≥ 10 and n(1 – po) ≥ 10; the sample has at least 10 expected successes and at least 10 expected failures • Big Population (Independence) ... Population at least 10 times sample size; and each observation has no influence on any other • ...if these conditions are satisfied, then we can use the Central Limit Theorem for sample proportions; distribution is ≈ Normal! That’s a great thing! • When doing a hypothesis test, you MUST check conditions... this is an essential part of the hypothesis testing process So many p’s... Caution! Hypothesis testing in four steps... Work stress... According to the National Institute for Occupational Safety and Health, job stress poses a major threat to the health of workers. A national survey of restaurant employees found that 75% said that work stress had a negative impact on their personal lives. A simple random sample of 100 employees from a large restaurant chain finds that 68 answer “Yes” when asked, “Does work stress have a negative impact on your personal life?” Is this good reason to think that the proportion of all employees in this chain who would say “Yes” differs from the national proportion p0 = 0.75? H0: p = 0.75 Ha: p ≠ 0.75 We want to test a claim about p, the true proportion of this chain's employees who would say that work stress has a negative impact on their personal lives. Work stress... Conditions: 1-sample proportion; α = 5% (rejection zone) Random Sample – stated in problem Large Sample Size/Normality - The expected number of “Yes” and “No” responses are (100)(0.75) = 75 and (100)(0.25) = 25, respectively. Both are at least 10. Big Population (Independence) - Since we are sampling without replacement, this “large chain” must have at least (10)(100) = 1000 employees. H0: p = 0.75 Ha: p ≠ 0.75 sample statistic = 68/100 Calculations for 1-sample proportion 2-sided hypothesis test; use Minitab 1 sample, proportion; change options and data as needed z = -1.6165 P-value = 0.1059 Work stress... Interpretation: Fail to reject Ho. With a P-value of 0.1059 and an α = 5%, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the proportion of this chain restaurant's employees who suffer from work stress is different from the national survey result, 0.75. We want to be rich... • In a recent study, 73% of first-year college students responding to a national survey identified “being very well-off financially” as an important personal goal. A state university finds that 132 of a random sample of 200 of its first-year students say that this goal is important. • Is there evidence that the proportion of all first-year students at this university who think being very well-off is important differs from the national value, 73%? Carry out a significance test to help answer this question. n = 200; x = 132; SRS; p = .73; 𝑝 = 0.66 We want to test Ho: p = 0.73 versus Ha: p ≠ 0.73 regarding the proportion of all first-year students at this university who think being very well-off is important differs from the national value of 73%. n = 200; x = 132; SRS; p = .73; 𝑝 = 0.66 Conditions: 1-sample proportion; α = 5% (rejection zone) Random Sample/SRS – stated in problem Large Sample Size/Normality – np ≥ 10 & n (1 – p) ≥ 10 (200)(0.73) ≥ 10 & (200) (1 -0.73) ≥ 10 Big Population (Independence) – We must assume at least (10)(200) first-year students in the population. n = 200; x = 132; SRS; p = .73; 𝑝 = 0.66 Calculations... Minitab; 1-sample proportion z = -2.22 P-value = 0.0258 Interpretation... Reject Ho. With a p-value of 0.0258, and assuming an α = 0.05, we conclude that we do have statistically significant evidence that the proportion of all first-year students at this university who think being very well-off is important differs from the national value. (determination, p-value, α, and context... always) Interpretation... Reject Ho. With a p-value of 0.0258, and assuming an α = 0.05, we conclude that we do have statistically significant evidence that the proportion of all first-year students at this university who think being very well-off is important differs from the national value. What if.... Our alpha had been 1%? Would our decision have changed? Dreaming in color... Researchers wondered whether a greater proportion of people now dream in color than did so before color television and movies became as prominent as they are today. In the past, before color TV and movies, this proportion was 0.29. Researchers took a random sample of 113 people. Of these 113 people, 92 reported dreaming in color. Is there evidence (at a significance level of 1%) that more people today dream in color than in the past (before color TV and movies became as prominent as they are today)? Carry out an appropriate hypothesis test to help answer this question. Dreaming in color... What are our null and alternative hypotheses? Conditions: 1-sample proportion; α = 1% (rejection zone) Random Sample – Large Sample Size/Normality Big Population/Independence – Calculations – Determination and interpretation - Dreaming in color... What are our null and alternative hypotheses Ho: p = 0.29 Ha: p > 0.29 Conditions 1-sample proportion; α = 1% (rejection zone) Random Sample Large Sample Size/Normality Big Population/Independence Calculations z = 12.28, P-value ≈ 0 Determination and interpretation Reject null hypothesis. At an alpha level of 1%, and a P-value of about zero, there is sufficient evidence to suggest that more people today dream in color than in the past (before color TVs, etc.) Two Proportion Hypothesis Testing • Ho: p1 = p2 • Ha: p1 ≠ or > or < p2 Minitab will calculate this for us; no need to memorize Two Proportion Hypothesis Testing Conditions... • SRS – Each of the two samples must be SRSs from their respective populations or they must each be randomized experiments • Normality – Each of the following are all ≥ 10 (n1)(𝑝c) (n1)(1 – 𝑝c) (n2)(𝑝c) (n2)(1 – 𝑝c) Two Proportion Hypothesis Testing Conditions... • Big Population/Independence Each of the populations must be at least (10) times each of the corresponding sample sizes; and one sample does not influence the other Two Proportions... Conditions... On our tests, I will not ask you to check conditions for 2-sample proportion hypothesis testing We will assume all conditions have been checked and met for 2-sample proportion hypothesis testing For your project, however, you will have to check conditions Does Pre-School Help? To study the long-term effects of preschool programs for poor children, a research foundation has followed two groups of Michigan children since early childhood. A control group of 61 children represents population 1, poor children with no pre-school. Another group of 62 from the same area and similar backgrounds attended pre-school as 3- and 4-year-olds represents population 2, poor children who attend pre-school. Sizes are n1 = 61 and n2 = 62. One response variable of interest is the need for social services as adults. In the past ten years, 38 of the preschool sample and 49 of the control sample have needed social services (mainly welfare). Carry out an hypothesis test to determine if there is significant evidence that pre-school reduces or increases the later need for social services? You may assume all conditions have been checked and met. Use an alpha level = 5%. n pre-school = 62 nno pre-school = 61 38 of pre-school needed social services; 49 of no pre-school needed social services State null and alternative hypothesis Ho: pno pre-school = ppre-school Ha: pno pre-school ≠ ppre-school Conditions: Stated in the problem that they have been checked and met. We are doing a 2-proportion hypothesis test. Ho: pno pre-school = ppre-school Ha: pno pre-school ≠ ppre-school Minitab to calculate test statistic, p-value, etc. Two Sample, Proportion, Options & Data z = -2.3201 P-value = 0.0203 Ho: pno pre-school = ppre-school Ha: pno pre-school ≠ ppre-school Interpretation: Reject null hypothesis. At a significance level of 5% (α = 0.05), and a p-value of approximately 0.02 there is sufficient evidence to show that p no pre-school ≠ p pre-school (or evidence that pre-school reduces or increases (changes) the later need for social services Fear of Crime... The elderly fear crime more than younger people, even though they are less likely to be victims of crime. One of the few studies that looked at older blacks recruited random samples of 56 black women and 63 black men over the age of 65 from Atlantic City, New Jersey. Of the women, 27 said they “felt vulnerable” to crime; 46 of the men said this. What proportion of women in the sample feel vulnerable? Of men? (Note: Men are victims of crime more often than women, so we expect a higher proportion of men to feel vulnerable.) Fear of Crime... Test the hypothesis that the true, unknown population proportion of all elderly black males who feel vulnerable is higher than that of all elderly black women who feel vulnerable. You may assume that all conditions have been checked and met. Hypothesis, Conditions/Name of Procedure/Alpha Level, Computations, Interpretation Ho: p men = p women or p men – p women = 0 Ha: p men > p women or p men – p women > 0 • sample statistics: 46/63 men & 27/56 women • z = 2.7731 P-value = 0.0028 • Reject null hypothesis. At any reasonable alpha level, with a P-value less than 1%, we have evidence to suggest that the proportion of all black men who feel vulnerable is higher than the proportion of all black women who feel vulnerable. Cholesterol & Heart Attacks... • High levels of cholesterol in the blood are associated with higher risk of heart attacks. Will using a drug to lower blood cholesterol reduce heart attacks? The Helsinki Heart Study looked at this question. Middleaged men were assigned at random to one of two treatments: 2,051 men took the drug gemfibrozil to reduce their cholesterol levels, and a control group of 2,030 men took a placebo. During the next five years, 56 men in the gemfibrozil group and 84 men in the placebo group had heart attacks. • Is the apparent benefit of gemfibrozil statistically significant? Assume all conditions have been checked and met. Use a 1% alpha level. Ho: pgemfibrozil = pplacebo Ha: pgemfibrozil < pplacebo OR Ho: pgemfibrozil - pplacebo = 0 Ha: pgemfibrozil – pplacebo < 0 We want to use this comparative randomized experiment to draw conclusions about p1, the proportion of middle-aged men who would suffer heart attacks after taking gemfibrozil, and p2, the proportion of middleaged men who would suffer heart attacks if they only took a placebo. We hope to show that gemfibrozil reduces heart attacks, so we have a one-sided alternative. Ho: pgemfibrozil - pplacebo = 0 Ha: pgemfibrozil – pplacebo < 0 n gemfibrozil = 2,051 n placebo = 2,030 x gemfibrozil = 56 x placebo = 84 Sample statistic for gemfibrozil = 56/2051 ≈2.7% had heart attacks Sample statistic for placebo = 84/2030 ≈ 4.1% had heart attacks Is this difference just due to chance? Or is there really a difference between the medication and the placebo? Ho: pgemfibrozil - pplacebo = 0 Ha: pgemfibrozil – pplacebo < 0 n gemfibrozil = 2,051 n placebo = 2,030 x gemfibrozil = 56 x placebo = 84 Sample statistic for gemfibrozil = 56/2051 ≈2.7% had heart attacks Sample statistic for placebo = 84/2030 ≈ 4.1% had heart attacks z = - 2.47 P-value = 0.0068 Reject null hypothesis. With an alpha level of 1% and a P-value of 0.0068, we have evidence that shows there is an apparent benefit to gemfibrozil (i.e., the drug appears to reduce heart attacks). Three-Strikes Law... California’s controversial ‘three strikes law’ requires judges to sentence anyone convicted of three felony offenses to life in prison. Supporters say that this decreases crime; opponents argue that people serving life sentences have nothing to lose, so violence within the prison system increases. Assume all conditions have been checked and met. Three-Strikes Law... • Researchers looked at data from the California Department of Corrections. • Of 734 prisoners who had three strikes, 163 of them had committed ‘serious’ offenses • Of 3,188 prisoners who did not have three strikes, 974 had committed ‘serious’ offenses • Determine whether those with three strikes tend to have more offenses than those who do not. Use a 5% significance level. Ho: p 3 strikes offenses = p no 3 strikes offenses Ha: p 3 strikes offenses > p no 3 strikes offenses Sample statistic for prisoners who had three strikes was 163/734 ≈ 22.2% Sample statistic for prisoners who did not have three strikes was 974/3188 ≈ 30.6% z = - 4.49 P-value = 0.9999 Fail to reject null hypothesis. At a 5% alpha level and a Pvalue ≈ 1, there is not sufficient evidence to conclude that prisoners who have three strikes commit more serious offences than those prisoners who do not have three strikes. We check conditions for a reason... • Again you do not have toworry about checking conditions for 2-sample proportion... It’s basically a mess to check conditions for 2sample tests... • But know that... in the real world, if conditions are not satisfied, our results may not be accurate, reliable, trustworthy, etc. Use & Abuse of Tests... • Significance tests are used in a variety of settings... Marketing, FDA drug testing, discrimination court cases, etc. • Significance tests quantify event that is unlikely to occur simply by chance • Different levels of significance (α) are chosen depending on the given situation; typically α = 0.10, 0.05, or 0.01 • Continue to use caution when using “prove” or “cause”... even when doing hypothesis testing Use & Abuse of Tests... • P-values allow us to decide individually if evidence is sufficiently strong • But, there is still no practical distinction between p-values of, say, 0.049 and 0.051 if our alpha level was, say, 5% • Statistical inference does not correct basic flaws in survey or experimental design, such as ... Using Inference to Make Decisions... Sometimes we do everything correctly... data collection, conditions, calculations, interpretation... but we still make an incorrect decision/determination... perhaps we just happen to get a sample statistic that is very extreme... that really doesn’t represent our population accurately ... we reject the null hypothesis when we really should have failed to reject (Ho was really true) OR we fail to reject the null hypothesis when we really should have rejected the null hypothesis (Ho was really false) ... we make an error Making errors when using inference... • Type I Error We reject Ho (null hypothesis) when Ho is really true In other words, we determine Ha (alternative hypothesis) is true when, in actuality, Ho (null hypothesis) is true • Type II Error We fail to reject Ho (null hypothesis) when Ho is really false In other words, we determine Ho (null hypothesis) is true, when, in reality, Ha (alternative hypothesis) is true Type I and Type II Errors... Probabilities of Type I and Type II Errors... • Probability of Type I Error (rejecting Ho when null is really true): α, your significance level for the hypothesis test. • Probability of Type II Error (failing to reject Ho when alternative is really true): β. Very complicated to calculate. Beyond scope of this course. Power of a Test... • Power: Probability that a test will reject Ho when Ha is true • Think of power as making the correct decision, not making an error, not making a mistake • High level of power is a good thing • Power = 1 – β (remember β is probability of making a type II error); so ‘power’ and β are complimentary Power of a Test... • How can we increase power (making the correct decision)? • Increase α • Increase n • Decrease standard deviation (same effect as increasing the sample size, n) Chapter 8 HW Quiz ...