Document 10291597

Objectives 6.2, 7.1 Tests of significance (CIS Chapters 11 and 12) p  The purpose of significance tests p  Stating and assessing hypotheses p  The t-statistic and the P-value p  Statistical significance p  The one-sample t test for a population mean p  One-sided versus two-sided tests p  Further reading: http://onlinestatbook.com/2/logic_of_hypothesis_testing/ logic_hypothesis.html and http://onlinestatbook.com/2/tests_of_means/single_mean.html What is a hypothesis? p  Based on observations we want to answer the following question: p  Question Can pigeons fly? p  p  p  We write this as a hypothesis (conjecture) H0 : Pigeons cannot fly (we call this the Null hypothesis and it is opposite of the conjecture) HA: Pigeons can fly (we call this the alternative hypothesis and this is the conjecture of interest). p  If we can disprove the null – show that the null is unlikely – this in turn means the alternative seems likely. p  Scenario1: You are watching pigeons and you see a pigeon fly. You have immediately disproven the null (that pigeons cannot fly) and thus proven the alternative. p  Scenario 2: You are watching pigeons and they are busy eating. None of them are flying (far too much food to even attempt it). All this is consistent with the null being true, however it does not prove the null. In this situation we say there is no evidence to prove the alternative. The role of the null and alternative hypothesis p  In most types of research/investigation we make two hypothesis, the null and alternative. p  We can only prove the alternative, we cannot prove the null. This is because we can only prove the alternative by disproving the null. Ie. discounting the possibility of the null being true. This is why the `research hypothesis’ is always the alternative. q  q  In most situations we cannot switch the null and alternative about without taking great care to rewrite it correctly (comes later). Unlike the pigeon example, given on the previous slide, in statistics we cannot prove for certain that the alternative is true. Instead: q  We start by first writing down the null and alternative hypothesis of interest. q  We collect the data. q  q  We calculate how likely we can get the data we observe if the null hypothesis were true. If this chance/probability is very small, then the null hypothesis is unlikely to be true, which immediately suggests that the alternative hypothesis is true. We say `there is evidence to prove the alternative hypothesis’ or, equivalently, `there is evidence to reject the null’. Motivation 1 (red wine example) Let us return to the red wine example from the confidence interval section. It has been suggested that drinking red wine in moderation may protect against heart attacks. This is because red wind contains polyphenols which act on blood cholesterol. To write this in a statistical sense, we let µ denote the mean change in polyphenol levels if the entire population started to drink moderate amounts of red wine. If µ > 0, this means that on average polyphenol levels would rise when drinking red wine (thus proving the conjecture). We write this as: H0 : µ  0 | {z } mean levels of polyphenols have stayed the same or reduced vs HA : µ > 0 | {z } mean levels of polyphenols have risen Now we collect data. To see if moderate red wine consumption increases the average blood level of polyphenols, a group of nine randomly selected healthy men were assigned to drink half a bottle of red wine daily for two weeks. q  SCENARIO 1: The difference in polyphenol levels before and after the study is: -0.60 -1.05 -2.09 -1.23 0.71 -0.53 0.33 -0.48 -1.42. The sample mean for these 9 guys is x̄ = 0.7. Clearly for these guys polyphenol levels have not reduced. A sample mean of -0.7 is entirely consistent with the null being true (the mean level staying the same of reducing) – remember the sample mean is estimating the true mean. q  q  q  SCENARIO 1 (cont). Since x̄ = 0.7 is consistent with the null hypothesis µ≤0, there is no evidence to disprove the null. Therefore, there is no evidence in the data that the polyphenol levels increase with moderate consumption of red wine. SCENARIO 2: The difference in polyphenol levels before and after the study is: 0.06 -0.36 0.98 0.82 -0.25 2.49 -1.34 1.16 1.53. The sample mean for these 9 guys is x̄ = 0.56 . Now the sample mean and the population mean µ≤0 are in different regions. In other words, for this group of people a positive increase in polyphenol levels is seen. BUT does this disprove the null? Not without some calculations. It is still possible to get a positive sample mean, when µ≤0. We need to calculate how likely it is get a sample mean of x̄ = 0.56 when the population mean µ≤0. SCENARIO 3: The difference in polyphenol levels before and after the study is: 8.45 10.18 10.98 10.35 10.75 8.98 8.84 10.38 9.79. The sample mean is x̄ = 9.86 It is clear that polyphenol levels have risen for these randomly selected volunteers, and though now it is hard to articulate, it does not seem to by lucky chance. It really seems that this data is completely inconsistent with the null (population mean µ≤0) and strongly suggests that the alternative is true. SCENARIO 4: The difference in polyphenol levels before and after the study is: -0.43 -8.35 -8.31 26.11 4.32 25.02 9.40 11.54 0.71. The sample mean is x̄ = 6.66 . Does this data disprove the null? How likely to get this data under the null. A statistical test allows us to systematically navigate these different scenarios. q  Motivation 2: Does the lady take milk? p  Recall the tea story: One lady insisted that the tea tasted different depending on whether milk was poured into the cup and then the tea or if the tea was first poured and then the milk. Fisher suggests that this can be tested, by randomly giving her milk first and tea first cups and asking her to identify the cup. The competing hypothesis are: p  H0: The lady guesses which cup is which by random chance. p  HA: The lady is able to select the correct cup. p  We collect the data and find that she identifies all 8 cups of tea correctly. This is the data we observe. p  In order to prove the alternative we have to calculate how plausible it is to correctly identify all the cups of tea correctly, under the null that she was simply guessing. This is seeing whether the data is consistent with the null being true. If this probability is small, then it suggests that null is implausible (we have disproved the null). If the null is implausible, then this implies the alternative is plausible (there is evidence to suggest the alternative is true). Motivation 2 (cont)? p  The ONLY way to `prove’ the alternative (she has the ability to to correctly identify the cup) is to prove that the null is implausible. If the null is in anyway plausible, then we cannot reject the null (and prove the alternative). p  The probability of her identifying all cups correctly is 1/72. This means there is a 1/72 = 1.39% chance of her identifying the cups of tea by simply guessing. p  If the probability is over a threshold, then the null is deemed plausible and we cannot reject the null. If it is below the threshold then the null is deemed implausible and we can reject the null. p  Typically, the α=5% significance level as used as the threshold. Since 1/72 = 1.39% is LESS than 5%, at the 5% level we believe the null is implausible and thus reject it (saying that there is evidence to suggest the alternative, that she knows her tea, is true). p  However, we will never know the truth! There is a 1.39% she got the result by lucky guess. Our using a 5% threshold is admitting that we are willing to make 5% mistake of rejecting the null when it is in fact true. Motivation 3 (newspapers) Let p be the proportion of the population that is pro-gay marriage. We want to investigate whether over 60% of Americans are pro-gay marriage. H0 : p  0.6 vs HA : p > 0.6 | {z } | {z } the proportion is less than 60% q  the proportion is greater than 60% Scenario 1. A sample of 500 people were interviewed. In the sample 58% said they were pro-gay marriage. Is there evidence to back the newspapers claims? q  q  For this data set, 58%, is consistent with the null being true. We cannot disprove the null given the data. Thus there is no evidence to suggest that the proportion of Americans who are pro-gay marriage is over 60%. On the other hand, we cannot discount this claim since this is just a sample. It could be that the population is 60% pro-gay marriage, it is just not backed up with this sample (the difference could be due to sample variation). p  SCENARIO 2. A random sample of 500 people are interviewed. In that sample 62% said they were pro-gay marriage. What do you think about the newspaper’s claim based on this sample? p  In this situation the sample proportion is greater than 60%, but we need to ask ourselves could this be because really over 60% of the population are pro-gay marriage or the actual population is less than 60% and the 62% observed in the sample is simply due to random variation? §  p  p  Example: We know that proportion of females in the world in 50%. But in any given sample there could more or less than 50% females. Is 62% in the sample consistent with the null being true? To do this we calculate the chance that we can obtain a sample which gives a sample proportion of 62%, when, in fact, the population proportion 60%. This is the principle, in reality we are calculating a probability and this probability depends on a few ingredients: §  The size of the sample. §  The variability in the population (as measured by the standard deviation). Further reading: http://onlinestatbook.com/2/logic_of_hypothesis_testing/intro.html The underlying principle in a test p  In a hypothesis test we always calculate the probability of observing the data under the null being true. p  The underlying idea of a hypothesis test is that rare events are unlikely to happen. If this probability turns out to be small, it suggests the assumption (in this case the null hypothesis) is not true and that there is evidence the alternative is true instead. p  In most statistical tests we will encounter the underlying assumption will be based on the mean. p  p  p  This may seem very simple, but it will allow us to test a wide range of useful hypotheses. Most calculations will be made using that the sample mean is normal, therefore we always need to check this assumption – else the probability we calculate will be incorrect. In the next few slides we will explain how to calculate these probabilities. Purpose of SigniFicance Tests We have seen that the properties of the sampling distribution of x help us estimate an interval of plausible values for the population mean µ. p  We can also rely on the properties of the sampling distribution to test € hypotheses. It is based on determining how plausible a particular claim is. p  Example: You are in charge of quality control in your food company. You randomly sample fourteen packs of cherry tomatoes, each labeled 227 grams. The average weight from your fourteen boxes is 224g. Obviously, we cannot expect boxes filled with whole tomatoes to all weigh exactly 227 grams. p  p  Is the somewhat smaller weight simply due to chance variation? Or is it evidence that the machine that sorts the cherry tomatoes into packages needs to be recalibrated? p  The null hypothesis is a very specific statement about parameter(s) of the population(s). It is labeled H0. This is the hypothesis we assess. p  The alternative hypothesis is a more general statement about the parameter(s) that is exclusive of the null hypothesis. It is labeled Ha. We accept Ha only if we find H0 to be implausible. Weight of cherry tomato packs: H0 : µ = 227g (µ is equal to the value claimed by the produce company) Ha : µ ≠ 227g (µ is either larger or smaller than the value claimed) The tomato machine data p  Here is the actual data from the machine: p  224.1 222.2 226.4 224.8 218.7 231.2 224.8 226.7 217.7 227.6 212.5 236.1 226.6 219.9 This is the data, the sample mean is 224 (precisely 224.23) and sample standard deviation is 5.89. The basic prescription p  Much of what we do are calculations. p  However, in most data analysis, there will be no need to do the calculations. The software (such as Statcrunch) will give you some probabilities that you need to understand. p  However, understanding the calculations will help in understanding what these probabilities actually mean. p  After collecting the data,0the basic prescription is to make a z/t1 transform. That is: @X̄ A µ t = q  |{z} mean under the null s.e This will measure the number of standard errors the estimator is from the null hypothesis. The larger this value, the `less’ likely the null hypothesis. To find the probability of observing the data under the null we need to find the probability associated to the t-transform either by looking up the t-tables (or the z-tables if the population standard deviation is known). p  Having set-up the hypothesis we collect the data. We find that 14 boxes of tomatoes packed with the machine, gave the sample mean 224 grams. The sample standard deviation is 5.89 grams. p  Now our objective is to calculate the chance of getting a sample mean of 224 grams or lower under the null that the mean packing weight is the same as usual (227g). p  The standard error of the sample mean is 5.89/√14 =1.57. p  We assume that the sample size is large enough such that the sample mean is close to normally. If the null is true, we center the mean about 227g. Make a plot with 227g in the center. p  On this plot place 224g. To find out the chance of getting a sample mean of 224g or less when the mean is 227g we make a t-transform t= p  224 227 = 1.57 1.9 -1.9 tells us that 224 is -1.9 standard errors to the left of the mean 227 (if this were the true population mean). Looking up the area to the left of -1.9 (using Statcrunch) gives the probability 4%. DeFinition: The P-‐value (for two sided test) How unusual is this data, assuming it is properly calibrated (null is true)? We calculated that the sample mean is t = -1.9 standard errors from the mean under the null. The area to the left of -1.9 is 4%. Samples that are properly calibrated and are at least as unusual as this have t-value that is either greater than 1.9 or less than -1.9. The chance of this is the area to the right of 1.9 or area to the left of -1.9, which is 2×4 = 8%. Definition We want to quantify the proportion of random samples that are at least as unusual as our actual result, if the null hypothesis were true. This quantity is called the p-value. The p-value (for a two-sided test, which this is) is 2×the smallest area. Tomato Example: p-value = P (|t| 1.9) = P (t  1.9) + P (t 1.9) = 2 ⇥ P (t 1.9) = 8% Further reading: http://onlinestatbook.com/2/tests_of_means/single_mean.html Deciding the conclusion with α A very small P-value indicates that our results probably did not p  occur when the null hypothesis is true, and therefore H0 is implausible. It should be rejected. In this case we say the evidence is significant. p  The smaller the P-value the stronger the evidence against H0. p  The significance level α is the largest P-value for which we are willing to reject the null hypothesis. The value of α is decided before conducting the test. p  If the P-value is equal to or less than α then we reject H0. This is when we accept Ha as the truth. p  If the P-value is greater than α then we fail to reject H0. Whatever evidence there is, it is not sufficient to accept Ha. p  Typically we set α=5%. Does the packaging machine need recalibration? Recall our hypotheses. H0 : µ = 227g (µ is equal to the value claimed by the produce company and the machine does not need recalibration) Ha : µ ≠ 227g (µ is either larger than 227g or smaller than 227g, in which case the machine needs recalibration) The produce company traditionally uses α = 5% for their quality control. That is what you choose to do here. Since the P-value is 8%, it is larger than α and therefore H0 is not rejected, and the decision is to not recalibrate the machine. * If α had been chosen as 10%, then the P-value would be significant, H0 would be rejected. And the decision is to recalibrate the machine. No matter what we decide, there always is the possibility that the conclusion is incorrect. The point/boundary of decision when α=5% p  Recall, if α=5% we cannot reject the null. If α>5% we cannot reject the null. However, if α<5% then we can reject the null. p  α=5% is the boundary of the decision. We recall that this corresponds to 2.16 standard errors from the mean under the null (in this case 227g). p  Therefore, if the sample mean is in the interval (this is not a confidence interval):  5.89 5.89 227 2.16 ⇥ p , 227 + 2.16 ⇥ p = [223.6, 230.4] 14 14 q  q  Then we cannot reject the null at the 5% level, because the p-value for any number inside this interval will be larger than 5%. On the hand, if the sample mean is outside this interval, then the pvalue will be less than 5%. Thus, when the sample mean is outside this interval we can reject the null hypothesis (that the mean is 227g) at the 5% level. Because the standard deviation s=5.89 was estimated from the data we need to use the t-tables to obtain the probability. We used the t-distribution with 13 degrees of freedom rather than the normal. The area to the left of -1.9 can be calculated using either Statcrunch (go to Stat -> Calculators -> T and select df=13 and place -1.9 in the equation). You should get 0.04. The p-value is 2×4% = 8%. Alternatively, we can deduce bounds for the p-value using the tables: Thus the P-value is between 2×. 025 = .05 and 2×.05 = .10. This assesses the “believability” of the null hypothesis, given the evidence of the random sample. More Examples p  Let us return to the tomato problem. We want to see whether the tomato weighing feature is functioning in 4 different machines. So samples of size are collected from the 4 machines. q  Sample1 is from machine 1 etc. The back dot denotes the sample mean. Based on this plot and the sample size, which machines do you think will need readjustment? We will make this analysis precise in the next few slides. Tomato boxes using the t-‐tables p  Below we compare the t-transforms, with the t-value at the 2.5% point. From this we can deduce whether we can reject the null or not. Null X̄ s s.e= ps14 t-transform t13 (2.5) 227 ± s.e ⇥ t13 (0.25) p-value Result Scenario 1 µ = 227 227.1 0.597 0.160 227.1 227 = 0.63 0.16 2.16 [226.75, 227.35] 54% Cannot Reject Scenario 2 µ = 227 225.8 0.424 0.113 225.8 227 = 10.6 0.113 -2.16 [226.76, 227.24] 1.4 ⇥ 10 5 % Reject Scenario 3 Scenario 4 µ = 227 µ = 227 227.7 225.4 2.43 2.65 0.649 0.708 227.7 227 225.4 227 = 1.07 = 2.26 0.649 0.708 2.16 -2.16 [225.6, 228.4] [225.5, 228.5] 30.4% 4.16% Cannot Reject Reject (just about) Interpretation of the results p  We see that for sample 1, the p-value is 54%, which means there is a large chance of observing the sample mean 227.1 when the population mean is 227. The spread in the data is small and concentrated about the 227. There is no evidence that the machine 1 is not working correctly. p  In sample 2, the p-value is very, very small. Note: the spread in the data is small and concentrated quite far from 227. This strongly suggests that the machine needs to be readjusted, as it seems highly unlikely to observe the average 225.8 when the mean of the machine is 227. p  In sample 3, the p-value is 30.4%. This is large, and means that that there is a good chance of observing the average 227.7 when the population mean is 227. Looking at the data it is highly variable, and with such a small sample size, it it is not easy to say whether the machine needs readjustment or not. p  In sample 4, the average is 225.4 and the p-value is 4.16%. This means there is relatively small chance of observing the mean 225.4 when the true mean is 227. It is possible that the machine needs readjusting. Using a larger sample size q  It is costly to readjust the machine, so we need more compelling evidence that the machine’s mean is different to 227g. In this case we need to increase the sample size. We now consider the average over 100 boxes, for each of the machines. A plot of the results is below. Larger sample sizes p  Given that the sample size has increased we now calculate the likelihood of observing these averages. Null X̄ s s s.e= p100 t-transform t99 (2.5) 227 ± s.e ⇥ t99 (2.5) p-value Result Scenario 6 µ = 227 226.94 0.499 0.049 226.94 227 = 1.22 0.049 -1.984 [226.9, 227.1] 22% Cannot Reject Scenario 7 µ = 227 225.9 0.506 0.0506 225.9 227 = 21.7 0.05 -1.984 [226.9, 227.1] ⇡0 Reject Scenario 8 µ = 227 227.2 3.00 0.3 227.2 227 = 0.66 0.3 1.984 [226.4, 227.6] 51% Cannot Reject Scenario 9 µ = 227 226.1 2.76 0.276 226.1 227 = 3.33 0.276 -1.984 [226.45, 227.55] 0.12% Reject When comparing these results with the previous results, we see that the standard errors are smaller, this implies the non-rejection interval is narrower and the t-values can be far larger. Observe that the in the case we reject the null, the p-values are far smaller than when the sample size was 14. Calculation practice: Annual coffee shop sales The marketing firm that studied annual coffee shop sales had a statistical model that had predicted the yearly average to be $2.33 million. Now that they have data, they want to determine if this prediction was accurate. They choose to use α = 5%. p  Hypotheses: H0 : µ = 2.33 vs. Ha : µ ≠ 2.33 (in millions of dollars). p  The sample data are sample mean = 2.67, s=1.03, n=41, df=40. This gives the standard error = 0.16 t-value = (2.67 – 2.33)/0.16 = 2.125. p  From the t-table, using 40 df, the area to right of 2.125 is less than 2%. Using a computer we obtain the exact probability of 1.6%. So the P-value is 2 ×1.6% = 3.2%. p  Since P-value = 3.2% is less than α = 5%, the firm rejects H0 . (The evidence is weak for H0 and strong for Ha.) p  The firm concludes “At the 5% significance level, the evidence indicates that the model’s prediction was inaccurate.” Review of the Tests of SigniFicance 1.  State the null hypotheses H0 and the alternative hypothesis Ha. 2.  Calculate the value of the test statistic (such as a t-statistic). This is a measure of how much the data and H0 differ from each other. 3.  Determine the P-value for the observed data. This is the chance, if H0 is true, of observing a more extreme/unusual test statistic. 4.  Compare the P-value to the significance level α and decide whether or not there is sufficient evidence (i.e., P-value ≤ α) to reject the null hypothesis. 5.  State your conclusion in terms your audience will understand, citing the significance level used to obtain it. Comments on the decision rule p  The objective of a test is to make a decision between the plausibility of two competing hypothesis. p  The p-value is the probability of observing the data under the assumption the null is true. p  If the p-value is less than the significance level (often set at 5%). The decision is to reject the null and go for the alternative instead. p  If the p-value is greater than 5% than the data is consistent with the null being true and we cannot reject the null. p  The point is there is a chance we made the wrong decision. We could have wrongly rejected the null when actually the null is true. p  The chance of this happening is the significance level. In other words, if we set the significance level at 5% and our p-value is less than 5% there is 5% chance we have made the wrong decision. p  The value at which we set the significance level determines how willing we are to wrongly reject the null hypothesis. p  Examples: p  p  p  p  Suppose we are in a tomato packing plant. Our aim is to ensure that the mean weight of a tomato box is 227g. Every few hours we randomly sample 14 boxes of tomatoes and do a hypothesis test. Each test is done at the 5% level. We do the test 100 times, if the null hypothesis is true, the on average we would falsely reject the null 5 times. Each time we falsely reject the null, it is called a type I error or in medical terms a false positive. Suppose we reduce the significance level to 1%, in this case if the null were true we would falsely reject the null 1 time out of a hundred. We will show in Chapter 8 that by increasing the significance level (from, say 5% to 10%) we increase the number of false positives, but we are more likely to detect the alternative, if it is true. Decreasing the significance level will have the opposite effect. The SigniFicance level p  How to choose the significance level? p  p  p  p  p  p  p  There is a trade off between not wanting to falsely reject the null but wanting to detect the alternative. The lower the significance level, the less likely we are not falsely reject the null, but this makes detecting the alternative much harder! Example: Consider the court case H0: Innocent HA: Guilty. The p-value is the probability of observing the evidence given the null is actually true. If we set the significance level at 5% and say a person is guilty if the p-value is less than 5%, this means that we would put in prison 5% percent of all innocent people who were put on trial! For a democracy this is just too much! Therefore, in this situation, we need to places the significance level at a much lower value – to avoid throwing in jail too many innocent people. If the significance level is put to zero, this means that no one who is innocent is put into jail. However, it also means that all guilty people are free. We need to tread a line between the two. A common p-‐value misconception p  A very common misconception about a p-value is that the p-value is the probability of the null being true and (1 - p-value) is the probability of the alternative). This is not true. p  A p-value is simply the chance of observing what we do under the null being true. p  This misunderstanding about p-values can have severe consequences in criminal trials. p  For example, a juror in a court may hear something like `The DNA on the weapon matches the defendant, there is a one in million this could happen by random chance’. A misinformed juror may interpret this as ‘there is only a 1 in a million chance that he is innocent’. This is an incorrect understanding of what the probability means. Typically, the 1 in a million means that approximately 1 in a million people have the observed DNA. This no longer appears that improbable – there are 7 billion people in the world so on average 7000 of them will match this DNA! One-‐sided and two-‐sided tests A two-sided test of the population mean has these null and alternative hypotheses: p  H0 : µ = [a specific number µ0] Ha : µ ≠ [a specific number µ0] The tomato packaging and coffee sales examples were two-sided. A one-sided test of a population mean has one of these pairs of null and alternative hypotheses: p  H0 : µ ≥ [a specific number µ0] Ha : µ < [a specific number µ0] OR H0 : µ ≤ [a specific number µ0] Ha : µ > [a specific number µ0] Does moderate consumption of read wine increase polyphenol levels? H0: µ≤ 0 against HA : µ>0. This is a one-sided test. How to choose? The choice of a one-sided versus a two-sided test depends on our purpose for doing the investigation in the first place, as determined before we perform the test of statistical significance. When appropriate, one-sided tests are preferable. A health advocacy group suspects that a cigarette manufacturer sells cigarettes with a nicotine content higher than what they advertise in order keep consumers addicted to their products and thus maintain revenues. Here, the health advocacy group wants to determine whether the mean nicotine content of a brand of cigarettes is greater than the advertised value of 1.4 mg. But they will decide and publicize this only if the evidence is sufficiently strong to rule out the advertised value. Thus, this is a one-sided test: H0 : µ ≤ 1.4 mg Ha : µ > 1.4 mg It is important to identify both hypotheses before obtaining the data or else the idea of “significance” becomes meaningless. Examples: What is the hypothesis? p  Question A 2008 study reported that 88% of students owned a cell phone. There has been a recent health scare on cell phone use. You plan to take a SRS of students to see if the percentage has decreased. p  p  H0 : µ ≥ 88% against HA : µ < 88%. Question It is known that a freshman biology class has mean score 75%. A professor thinks that students who attend early morning classes have a higher mean score. Her early morning class this year can be considered as a sample of all students who take an early morning class. So she compares their average score to the mean score of 75%. p  H0 : µ ≤ 75% against HA : µ > 75%. More examples p  Question Experiments on learning in animals sometimes measure how long it takes a mouse to find its way through a maze. The mean time is 20 seconds for one particular maze. A researcher thinks that playing loud music will cause the mice to complete the maze slower. She measures how long each of 12 mice take to get through the maze with the loud music stimuli. q  q  H0 : µ ≤ 20 against HA : µ > 20. Question The price of gasoline has changed, previously the mean yearly mileage of a vehicle was 4000 miles. I want to see whether the mean yearly mileage has changed after the price change. q  H0 : µ = 4000 against HA : µ ≠ 4000. Calculation of p-‐values for one-‐sided tests p  The calculation of p-values for one-sided tests is almost the same was the calculation of the p-value for two-sided test. p  p  p  p  p  As in two-sided tests we make the same z or t-transform. Once the z or t-transform has been made, now we have to take care. We need to look in what direction the alternative arrow is pointing in. If the alternative arrow is pointing to the right, eg. HA : µ > 20, then the pvalue is the area to the right of the z or t-transform. Unlike the two-sided test case we do not double the probability. If the alternative arrow is pointing to the left, eg. HA : µ > 20, then the pvalue is the area to the left of the z or t-transform. Again do not double the probability. p  Note, that for one-sided tests, the p-value can be larger than 50%, when the sample mean is on the `other’ side of the alternative. For example, if HA : µ > 20, and we have a sample mean = 19, then the p-value has to be greater than 50%. From a plot you can easily see why this is true. p  In the next few slides we will get some practice in these ideas. Recap: P-‐values in one-‐sided and two-‐sided tests for Ha: µ > µ0, P-value = P(T ≥ t) (if t < 0, do not reject H0) One-sided test for Ha: µ < µ0, P-value = P(T ≤ t) (if t > 0, do not reject H0) Two-sided test for Ha: µ ≠ µ0, P-value = 2×P(T ≥ |t|) To calculate the P-value for a two-sided test, use symmetry. Find the P-value for a one-sided test and double it. One sided tests: Red Wine and Polyphenols (4 scenarios) p  Recall our aim was to see whether the consumption of red wine increased polyphenol levels. We state this as: H0 : µ  0 vs HA : µ > 0 | {z } | {z } mean levels of polyphenols have stayed the same or reduced mean levels of polyphenols have risen We obtain the results for the four different scenarios considered at the start of this chapter. The alternative points to the RIGHT so we need to the area to the RIGHT. Data Null X̄ s s.e= ps9 t-transform t8 (5) ( 1, 0 + s.e ⇥ t8 (5)] p-value Result Scenario 1 Scenario 2 Scenario 3 Scenario 4 -0.60, -1.05, -2.09 0.06,-0.36,0.98 8.45,10.18,10.98 -0.43,-8.35, -8.31 -1.23, 0.71, -0.53 0.82,-0.25,2.49 10.35,10.75,8.98 26.11,4.32,25.02 0.33, -0.48, -1.42 -1.34,1.16,1.53 8.84,10.38,9.79 9.40,11.54 ,0.71 µ0 -0.7 0.87 0.29 0.7 0 2.41 0.29 = 1.86 ( 1, 0.53] 97.9% Cannot Reject µ0 0.56 1.15 0.383 0.56 0 0.383 = 1.46 1.86 ( 1, 0.72] 9.12% Cannot Reject µ0 9.86 0.90 0.3 9.86 0 = 32.8 0.3 1.86 ( 1, 0.56] ⇡ 0% Reject µ0 6.66 12.6 4.2 6.66 0 = 1.58 4.2 1.86 ( 1, 7.8] 7.6% Cannot Reject The t and p-‐values for the red wine problem On the left we give the pvalues for each of the scenarios. Remember we need to calculate the area to the RIGHT of each t-value since the alternative hypothesis is pointing to the RIGHT (HA :µ>0). The 4 red wine examples in Statcrunch p  Compare the result done by hand with that done in Statcrunch Match the standard errors, t-stat and P-values with the results done by hand. Deducing one-‐sided results from two-‐sided p  p  We recall for a given data set and population mean we can do three different tests. However, the results of each test are connected. For example, suppose we want to test the hypothesis that red wine decreases polyphenol levels. Then our hypothesis of interest is H0 : µ ≥ 0 against HA : µ < 0. We are given the output from the first data set This is the result of the test H0 : µ ≤ 0 against HA : µ > 0. The p-value for this test 98%, and there is no evidence to reject the null (the sample mean is negative). However, if we test H0 : µ ≥ 0 against HA : µ < 0, the p-value is the area to the LEFT of -2.45, which is 100-98% = 2%. Therefore, there is evidence to suggest that µ < 0, hence we can reject the null of this hypothesis. If we test H0 : µ = 0 against HA : µ ≠ 0, the p-value is 4% and there is evidence to suggest the mean is not zero. Example: Gestational diabetes p  Let us return to the example of testing for gestational diabetes. p  We will use the data to collected to test for gestational diabetes. We know that a patient has gestational diabetes if the mean glucose level of the patient is over 140. This means we are testing: p  p  Question A patient goes to the doctors. We do not know if she has gestational diabetes (µ is unknown). The glucose level in her blood samples is assumed to normally distributed with σ=4. After taking 4 blood samples her sample mean is 145. Is there evidence that she has gestational diabetes? Answer: We want see whether she has gestational diabetes, this means discounting the possibility that she does not have gestational diabetes. We want to test H0: µ≤140 against the alternative HA: µ > 140. §  To this we need to know the variability in the sample mean, this is quantified by the standard error = 4/√4 = 2. §  Next we have to calculate how far her sample mean is from the mean if she were healthy: z-transform = (145-140)/2 = 2.5 (we call it a ztransform rather than a t-transform because we know the standard deviation). Calculation Practice (cont). §  Since the alternative is pointing to the right, we need to calculate the probability to the right of 2.5. From the z-tables this is 0.6%. §  0.6% is quite small. It says the chance of getting a sample mean of 145 or higher, when the patient does not have gestational diabetes is 6 in a 1000. §  As this quite a small chance and is below the standard α=5% threshold, we reject the null and conclude that the patient has gestational diabetes. Thus refer her for more tests. However, there is always a chance we are making the wrong decision. Since there is a 6 in a 1000 chance of observing this data when she does NOT have gestational diabetes. Example: Low Potassium p  Hypokalemia is diagnosed when the blood potassium level is below 3.5mEq/dl. The potassium in a blood sample varies from sample to sample and follows a normal distribution with unknown mean but standard deviation is known to be 0.2. We only `diagnose’ low potassium when we discount the possibility that the potassium levels are normal. p  Question: State the hypothesis of interest. p  Answer: H0 : µ ≥ 3.5 against HA : µ<3.5. p  p  Question: A patient has 9 blood samples taken, his sample mean/average is 3.4, is there evidence to suggest low potassium (use 5% significance level)? Answer: The alternative is pointing LEFT so the p-value is the area to the left of 3.4 3.5 0.1 z= p = = 0.06 (s.e = 0.2/ 9) 1.5 Looking up the z-tables (remember the standard deviation is known) gives the p-value 6.68%. As this is greater than 5% we cannot reject the null. There is not enough evidence that he has low potassium. The tests in Statcrunch p  By giving Statcrunch the sample mean, standard deviation, sample size and the hypothesis under investigation, Statcrunch will give us the p-value and it is our job to understand what it means. p  To do this, go to Stats -> T-statistics (if the standard deviation is estimated from the data, else z-statistics) -> One Sample -> With summary. In the box you input the sample mean, standard deviation and sample size. In the next box choose the null of interest and also the alternative of interest (whether it is a two-sided test or one-sided test – more of this later). p  You should then get output with the p-values. Lab practice p  Load the calf data into Statcrunch. We want to draw inference about the mean weight of a newborn calf based on the sample mean of 44 calves. p  We first make a histogram of the data, to see if there are any major deviation from normality. p  The distribution of weights at birth does not have a obvious skew or thick tail. This means that distribution of the sample mean based on a sample of 44 will be very close to normal. So we can rest assured that using the t-distribution (since the standard deviation is unknown) will be reliable. p  Now we construct a 95% confidence interval for the mean. We can do this in by going to Stat -> T-statistics -> One sample -> with data -> putting Weight W0 into right box, then select Confidence interval and Calculate. This will give you a 95% confidence interval using the distribution. p  p  p  [90.85,95.58] This means with 95% confidence the mean weight of new born calves should lie in this interval. We now want to see whether there is evidence to suggest the mean weight of calves is greater than 90 pounds. Ie. H0: µ ≤ 90 against HA : µ > 90. We can already see from the confidence interval, that it seems unlikely. Later in this chapter we will see how tests and confidence intervals are related. p  We can also deduce the p-value in Statcrunch. Again Stat -> Tstatistics -> One sample -> with data -> putting Weight W0 into right box, then pressing next. Select Hypothesis Test. Place box, in the Null: mean = 90. And choose as the alternative >. Then press calculate. It will calculate the p-value using the t-distribution with 43 degrees of freedom. p  q  You get the p-value 0.44%. This means, at the 1% level we can reject the null. Looking at the 98% CI we see that the 90 does not lie in this interval, this fits with the p-value being less than 1% (we cover this in later slides). More lab practice p  212 earthquakes of magnitude 6.0 or higher were observed in the period 09/01/10 to 08/31/11. We are interested in the depth of these earthquakes and whether the average depth (µ) exceeds 50 km. p  The summary data are p  The data are quite highly skewed. They may even be bimodal. However with a sample size of 212 it is reasonable to suppose the sample mean is close to normal (we can also check use the app). [Stat-Summary Stats-Columns] mean earthquake depth (cont.) p  We will get a 95% confidence interval for µ. p  First get t* = 1.971 (for df = 211, area to left = .975). [Stat-Calculators-T] p  Next, compute the interval. x ± t * ×s / n = 65.73 ± 1.971×125.65 / 212 = 65.73 ± 17.01 = (48.72,82.74). p  p  The average depth is between 48.72km and 82.74km, with 95% confidence. Or, from StatCrunch, [Stat-T Statistics-from data] mean earthquake depth (cont.) p  We also will use α = 5% for a hypothesis test of H0: µ ≤ 50 versus Ha: µ > 50. p  p  x − 50 65.73 − 50 = = 1.823. s / n 125.65 / 212 Next, look up the P-value = 0.035. First compute the t-statistic t= [Stat-Calculators-T] p  p  P-value = 3.5% < α = 5%, so we reject H0 and conclude the average depth is greater than 50 km. Or, from StatCrunch, [Stat-Summary Stats-Columns] Calculation practice: Sweetening colas A cola manufacturer wants to test how much the sweetness of a new cola drink is affected by storage. The sweetness loss due to storage will be evaluated and scored by 10 professional tasters (by comparing the sweetness before and after storage): We only want to test if storage results in a mean loss of sweetness. That is, if µ is the mean change in the tasters’ scores, we are only concerned with whether the evidence will show that it is negative. So the hypotheses are: H0: µ ≥ 0 vs. Ha: µ < 0 Note that these are determined prior to obtaining the data. This choice will affect how the P-value is calculated. The next step is to obtain the data and compute the t-statistic. Taster 1 2 3 4 5 6 7 8 9 10 Sweetness change −2.0 −0.4 −0.7 −2.0 0.4 −2.2 1.3 −1.2 −1.1 −2.3 sample average = −1.02 standard deviation = 1.196 degrees of freedom = 10 – 1 = 9 t= x −µ s n = −1.02 − 0.00 1.196 10 = −2.697. The large, negative t indicates substantial evidence in favor of the alternative hypothesis. Sweetening colas (continued) Is there sufficient evidence that storage results in sweetness loss for the new cola recipe at the 0.05 level of significance (α = 5%)? H0: µ ≥ 0 versus Ha: µ < 0 (one-sided test) We have t = −2.70 with 9 df. Since the test is one-sided, only values less than –2.70 are more in favor of Ha than our results (and thus more relevant). So P-value = area to the left of –2.70 = the area to the right of 2.70. p  p  From the t-table: 2.398 < 2.70 < 2.821 thus 0.02 > P-value > 0.01. Since P-value < α = .05, the result is significant and H0 is rejected. There is a loss of sweetness, on average, following storage. The t-score associated with probability 0.05 is the critical value tα = 1.833. This represents the smallest value (in magnitude) for which the null hypothesis would be rejected at significance level α. Two-‐sided tests and conFidence intervals p  There is a close connection between confidence intervals and two-sided tests. Let us return to the one bed apartment in Dallas example. p  10 apartments are randomly sampled. The sample mean and the sample standard deviation based on this sample is 980 dollars and 250 dollars (both are estimators based on a sample of size ten). The 95% confidence interval for the mean is [980±2.262×79]=[801,1159]. q  Suppose we want to know whether the price of apartments has changed since last year, where the mean price was 850 dollars. q  q  q  Based on this interval we see that 850 dollars is contained in this interval. This means the mean could be 850 dollars . There given the sample it is unclear whether the mean price of apartments is the same since last year or not. We can rewrite the above as a statistical test H0: µ = 850 against HA : µ ≠850. The t-transform is t = (980-850)/79 = 1.64. Looking at the t-distribution, we see that 1.64 < 2.262 (this is the t-value corresponding to 9df at 2.5%). Therefore, the p-value is greater than 5%. Thus we cannot reject the null at the 5% level. Further reading: http://onlinestatbook.com/2/logic_of_hypothesis_testing/sign_conf.html p  Summarizing these two observations we see that: p  850 lies inside the 95% confidence interval [801,1159]. p  We are unable to reject the null at the 5% level. p  If the mean under the null lies in the 95% confidence interval, then this implies the corresponding p-value will be greater than 5%. p  On the other hand, if the mean under the null does not lie in the 95% confidence interval its p-value will be less than 5%. p  This is easily seen with an illustration (see later slides). p  p  If 850 is in an interval centered about 980 (where each side has length 178.7). Then 980 must be the interval centered about 850 with sides of length 178.7. A few slides earlier we showed that this interval [850±2.262×79]=[671,1028] corresponded to points where we make a decision to reject the null or not at the 5% level. In general, if the mean under the null lies in a (1-α)×100% confidence interval, then the p-value for a two sided test will be greater than α. ConFidence intervals and one-‐sided tests Consider the polyphenol and red wine example considered in Chapter 6. 15 randomly sampled men were asked to drink red wine every day for two weeks. Their change in polyphenol levels was measured: 0.7, 3.5, 4.0, 4.9, 5.5, 7,0, 7.4, 8.1, 8.4, 3.2, 0.8, 4.3, -0.2, -0.6, 7.5. The average change is 4.3 and sample standard deviation is 3.06. p  Review: Two-sided tests and confidence intervals p  p  The 95% confidence interval for the change in polyphenol levels is [2.6,5.99]. This means if I am testing the hypothesis H0:µ = 0 against the alternative HA: µ ≠ 0, since 0 is not in the interval the p-value is less than 100 – 95% = 5%. The 99% confidence interval for the chance in polyphenol levels is [1.94,6.66]. This means if I am testing the hypothesis H0:µ = 0 against the alternative HA: µ ≠ 0, since 0 is not in the interval the p-value is less than 100 – 99% = 1%. q  One Sided test (pointing RIGHT) Suppose we are testing that polyphenol levels increase. This means testing the hypothesis H0:µ ≤ 0 against the alternative HA: µ > 0. The p-value is the area to the right of 4.3 (see that the alternative is pointing to the right). Since from above we have deduced that in the two sided test the p-value is less than 5%, so for the one-sided the p-value is less than 2.5%. q  Why? Recall the p-value for two-sided tests is the smallest area to the left/right of of the t-transform times 2. In this case it is the area to the right of 4.3 times 2. For the two sided test we have deduced that the p-value is less than 5%, this implies that the area to the RIGHT of 4.3 is less than 5/2 = 2.5%. The p-value for the one-sided test pointing to the RIGHT is the area to the right of 4.3. We have just shown that the area to the right of 4.3 less than 2.5%. Thus the p-value for the one-sided test pointing to the RIGHT is less than 2.5%. q  One Sided test (pointing LEFT) Suppose we are testing that polyphenol levels decrease. This means testing the hypothesis H0:µ ≥ 0 against the alternative HA: µ < 0. Since 4.3 is not in the 95% confidence interval this means the p-value is greater than 97.5% (there is no evidence to reject the null – which is clear 4.3 lies within the null hypothesis). q  Why? On the previous slide we showed that the p-value for the hypothesis pointing to the RIGHT is less than 2.5% - the area to the RIGHT of 4.3 is less than 2.5%. The p-value for the test pointing to the LEFT is the area to the LEFT of 4.3. Which has to be greater than 97.5% (since the area to the left plus the area to the right is 100%). But this is obvious. The point of a test is to see how plausible the data is under the null. If the sample mean is 4.3 and the null is that the true mean is greater than or equal to 0, this is highly plausible! If this is highly plausible we cannot reject the null. Illustration, mean in conFidence interval Illustration, mean not in conFidence interval Example 1: CI and testing Scientists want to understand whether Omega 3 supplements increase the IQ of people. They randomly sampled 30 people (who previously did not take any supplementation), took their IQ before the experiment and asked them to take a daily 1000mg dose of EPA/DHA Omega 3. After two months they measured the IQ again. They took the difference between the current IQ (after supplementation) and previous IQ (before supplementation) and evaluated the average, which was x̄ = 7 (so for this group there was an overall increase, but we do know with out statistics, whether this is by chance). The 95% CI for the mean change was [-1,15]. q  Question We want to test the hypothesis H0:µ =0 against HA: µ ≠ 0, what are the results of the test using the 5% significance level? q  Answer Because 0 is in the 95% CI interval [-1,15], the p-value for the two sided test is greater than 5% so there is not enough evidence to reject the null. p  Question We want to test the hypothesis H0:µ ≤0 against HA: µ > 0 (in fact this is really the hypothesis of interest as it asks whether Omega 3 on average Omega 3 increases IQ) what are the results of the test using the 5% significance level? p  Answer In this case, the p-value is the area pointing RIGHT of 7. This is the smallest area. We know from the p-value for the two sided test is greater than 5%. This means the p-value for this one-sided test is greater than 2.5%, so we would NOT be able to reject the null if we did the test at the 1% level. However, it is unknown whether the p-value is less 5%, so we do not know whether or not we can reject the null at the 5% level. Further calculations need to be done to determine the p-value in this case. p  p  Question We want to test the hypothesis H0:µ ≥0 against HA: µ < 0 (the hypothesis of interest asks whether Omega 3 on average decreases IQ) what are the results of the test using the 5% significance level? Answer Since the average is x̄ = 7, lies within the interval under the null (µ≥0), there is no evidence to reject the null. Example 2: CI and testing Scientists want to understand the Omega 3 supplements increase the IQ of people. This time they randomly sampled 100 people (who previously did not take any supplementation), took they IQ before the experiment and asked them to take a daily 1000mg dose of EPA/DHA Omega 3. After two months they measured the IQ again. They took the difference between the current IQ (after supplementation) and previous IQ (before supplementation) and evaluated the average, which was x̄ = 6.5. The 95% CI for the mean change was [2.11,10.88]. q  Question We want to test the hypothesis H0:µ =0 against HA: µ ≠ 0, what are the results of the test using the 5% significance level? q  Answer Because 0 is not in the 95% CI interval [2.11,10.88], the area to the RIGHT (we need the smallest area) of 6.5 will be LESS than 2.5%. Thus the p-value is less than 5% and we can reject the null at the 5% level. p  Question We want to test the hypothesis H0:µ ≤0 against HA: µ > 0 (in fact this is really the hypothesis of interest as it asks whether Omega 3 on average Omega 3 increases IQ) what are the results of the test using the 5% significance level? p  Answer In this case, the p-value is area to the RIGHT of 6.5, which we know from the two-sided test is LESS than 2.5%. This means we can reject the null at the 5% level. p  Question We want to test the hypothesis H0:µ ≥0 against HA: µ < 0 (the hypothesis of interest asks whether Omega 3 on average decreases IQ) what are the results of the test using the 5% significance level? Answer Since the average is x̄ = 6.5 lies within the interval the null (µ≥0), there is no evidence to reject the null. p  It is important to observe that the p-values for both the one-sided tests will always add up to one. What is wrong with the following? q  A random sample of size 30 is taken from a population that is assumed to have a standard deviation of 5. The standard deviation of the sample mean (standard error) is 5/30. q  p  Recall, the standard error is 5/√30. A study where the sample mean is 45, reports a statistical significance (p-value less than α%) for H0: µ≤ 55 against HA : µ> 55. q  This is an example where you need to consider which way the one-sided test is pointing. It is clear that with a sample mean of 45, there is no evidence what so ever to support the null. In this case the p-value will be greater than 50%. q  Why? Because the p-value is the area to the right of 45 with the mean centered at 55. If you make a plot, it is clear to see that the pvalue is greater than 50%. p  A test rejected the null hypothesis that the sample mean was equal to 50. p  p  Hypotheses are always about the population mean, which is unobserved. It makes no sense to state the hypotheses in terms of the sample mean which is observed. A test preparation company wants to test that the average score of their students on the ACT is better than the national score of 21.5. They state their alternative hypothesis as HA: µ> 21.5. The zvalue is equal to 0.018. Because this is less than the significance level 5%, the null hypothesis is rejected. q  This is an example where the z-transform has been mistaken for a probability! We need to deduce the probability (which is the p-value) by looking up 0.018 in the z-tables. This turns out to be 49% (0.018 is the number of standard deviations the sample mean is from 21.5), since it is so close to the mean, it is clear that the p-value will just below 50% and we cannot reject the null. How reliable are these p-‐values? p  Remember the p-values we have calculated so far always use the normal or t-distribution (depending on whether the population standard deviation is known or not). p  Underlying these calculation is the assumption that the sample mean is normally distributed (remember we always make a plot of of the normal distribution and center it about the mean under the null). If the sample size is not large enough, so the central limit theorem has not `kicked-in’, then the sample mean won’t be normally distributed. This means the probabilities we have calculated won’t be reliable – just like the 95% CI for the mean won’t really be a 95% confidence interval. q  In this case we must be cautious in interpreting the results of the test. If the p-value is extremely small (say 0.0001), it would be small even if the correct distribution of the sample mean were used. On the other hand, if the p-value is close to the 5% significance level we need to careful about its statistical significance. Example: Siblings p  The university is interested in the (population) mean number of younger siblings a student has at the university (in the hope that they will attended the university). They believe that the mean is greater than 0.25. To test this hypothesis, HA: µ≤ 0.25 against HA: µ> 0.25 they randomly sample 3 students ask them how many siblings they have, they answer 0, 1, 3. The sample mean is 1.33 and the sample standard deviation is 1.53. p  p  Question: What are the conclusions of the test at the 10% level and comment on the reliability of the result. Answer: The t-transform is t = (1.33-0.25)/(1.53/√3) = 1.22. Using the ttables (with 2df) we see this lies somewhere between 15-20%. Since the alternative hypothesis is pointing RIGHT this means the p-value is between 15-20%. Now we comment on the reliability of this p-value. In HW9, Q1 we made plot of the sample mean (based on size 3) for younger sibling numbers. q  q  q  The distribution of the sample mean is the lowest plot on the left, this is clearly not normal (see also the corresponding QQplot). This means that the p-value is not correct, it is based on normality when the sample mean is not normal. This means we have to be very careful when we interpret this p-value. We recall if the sample size is larger (in Q2, Quiz 9 we looked at sample size n = 150), then sample mean is close to normal and we corresponding p-value will be closer to the truth (as it if came from the true distribution of the sample mean). Accompanying problems associated with this Chapter p  Quiz 9 p  Quiz 9 part 2 p  Quiz 10 p  Quiz 11 p  Part of Homework 4 p  Homework 5 (Q1-Q6).

Document 10291597

Related documents

Products

Support

Document 10291597

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib