Hypothesis testing Summer Program Brian Healy Last class Study design – What is sampling variability? – How does our sample effect the questions we can answer? Basics of probability Central limit theorem Sample mean What are we doing today? Rare event p-value Hypothesis test t-distribution / sample standard deviation Big picture We discussed last week that we could estimate the population mean with the sample mean and the central limit theorem told us the distribution of the sample mean. Now, we are going to consider testing whether or not our sample mean is equal to a hypothesized value. We call this hypothesized value the null hypothesis. This test allows us to compare our sample to a value in a statistically meaningful way. Null hypothesis We set up our null hypothesis so that we can reject the null hypothesis. The test is designed to disprove the null The first and most important step in any problem. This part requires knowledge of the problem. Notation: H0 H0: My mother can run a 5 minute mile. – Not: My mother cannot run a 5 minute mile. H0: The probability of heads on the coin is 0.5. – Not: The probability is not 0.5 Alternative hypothesis Notation: HA or H1 Has two characteristics – Must cover all values not included in the null – Must contain the value that we think is going to happen HA: My mother runs a mile slower than 5 minutes HA: The probability of heads is not 0.5 Hypothesis test Definition: A statistical test of a null hypothesis Completed under the assumption that the null is true (conditional probability) Always want to disprove the null hypothesis – – – – Ex. H0: Mom’s mean time<=5:00 One-sided HA: Mom’s mean time>5:00 Alternatively: H0: Probability of heads=0.5 Two-sided HA: Probability of heads != 0.5 The most important step is properly defining the null and alternative hypotheses How do we test this hypothesis? Take a sample As we have discussed, we want to think carefully about the how to collect the sample to ensure that we limit bias confounding and allow the results to be generalized to the proper population. From this sample, we can find a summary statistic and compare this to null hypothesis – Mean (t-test, linear regression) – Median (Wilcoxon tests, quantile regression) What does this have to do with the CLT? To test a hypothesis, we take a sample and find the sample mean – Ex. Have my mom run a mile 10 times, or flip the coin 20 times – Determining the proper sample size is next class Under the null hypothesis, we know the population mean We sometimes may know the population variance The distribution of the sample mean is normal with known mean and variance under these conditions Distribution of test statistic Under the null hypothesis, we know that the distribution of x is normal with mean m and standard deviation n Now, we want to find the probability of observing the sample mean or a value more extreme, under the null (p-value) to see if the null hypothesis is likely true or false. Have we observed a rare event? Is it rare enough to reconsider the null? What is a rare event? My mom claims that she runs a mile in 5 minutes. I think she can’t How can I test this? What happens if she ran a mile in – 5:15 minutes? – 6 minutes? – 10 minutes? What if she ran 5 separate miles at 10 minutes on average? What is a rare event? You play a game against a friend. In this game, you win a dollar if the coin is heads and you lose a dollar if the coin is tails What is the null hypothesis? What if the coin landed on tails 2 consecutive times? What if the coin landed on tails 10 consecutive times? At what point would you start to get suspicious? We want to know if the event we observed could have happened simply by chance or if something else is more likely going on P-value Tells you how rare the event is Definition: Given a null hypothesis, the probability of the observed value or something more extreme P(event or something more extreme | Ho is true) Ex. Coin toss problem – Null hypothesis: P(tails)=0.5 – Sample 9 out of 10 tails – P(9 or more tails | H0 is true)=P(9 tails | H0 is true)+P(10 tails | H0 is true)=0.011 Alpha level-type I error Definition: probability of rejecting the null hypothesis when the null hypothesis is in fact true (rejection probability). Usually 0.05 or 0.1, but set by the investigator Compare the p-value to the alpha level to determine if you have a significant result. This value defines how rare an event needs to be for use to say that the event did not occur by chance. It is called an error because this conclusion that the result was not due to chance is wrong a*100% of the time. One-sided or two-sided Steps for hypothesis testing State null and alternative hypotheses State type of test and alpha level Determine and calculate appropriate test statistic Calculate p-value Decide whether to reject or not reject the null hypothesis 1) 2) 3) 4) 5) • 6) NEVER accept null Write conclusion Example A study in New Bedford was looking at pregnant teens to see how long after pregnancy did each young woman arrive at the physician’s office for the first visit and the amount of time between the first visit and the second visit. Questions: Do teens from a low income area arrive at a clinic later than the average woman? Is there more time between the first and second visit among these teens? It is known that the average amount of time from conception until a woman first visits her doctor is 8.5 weeks (this number is an estimate because it is difficult to know exactly when conception occurred) and the average amount of time from first visit to second visit is 4.3 weeks. For the moment, let’s assume that we know the population standard deviations for each of these are 2.6 weeks and 2.2 weeks, respectively. We have collected a sample of 35 pregnant teens and we would like to know if they take longer to get their first visit than the average woman Sample data As with all of the data sets from now on, the data is on the BIO232 website. Let’s determine the mean for this sample and compare it to the hypothesized value. preg<-read.table(“preg.dat”, header=T) first<-preg[,1] mean(first) #This is the sample mean [1] 9.74 So the sample mean is clearly not equal to the population mean (8.5 weeks), but is it sufficiently different to say that these girls are different than the population. Steps for hypothesis testing Null: m=8.5 weeks, Alternative: m != 8.5 weeks 2) One sample hypothesis test, alpha=0.05 1) 3) z x m 9.74 8.5 2.82 n 2.6 35 Area in upper tail = 0.0024, p-value = 0.0048 5) Reject null 6) Conclusion: There is a difference in the amount of time from conception to the first visit to a physician. The time is longer for the pregnant teens. 4) Picture Here is a picture Area=0.0024 Area=0.0024 8.5 9.74 Normal hypothesis test in R To complete a normal hypothesis test in R, you can simply use the pnorm command with the appropriate mean and standard deviation. Remember, pnorm provides the area in the lower tail in all cases For the previous problem, to get the appropriate 2-sided p-value, use (1-pnorm(9.74,8.5,2.6))*2 Another way to look at the test Given a specific alpha level, you can find the cut-off for which all values more extreme, the null hypothesis would be z rejected The region more extreme 8.5 is called the rejection region cut 8.5 1.96 For our present problem, 2.6 35 the cut-off for the rejection region would be Area=0.025 cut-off=9.36 2.6 cut 8.5 1.96 9.36 35 Practice Here are the times my mom ran in the 10 trials. Test the null hypothesis that she can runs a 9:00 mile on average. mom<-c(9.5, 10, 8.75, 9, 11.2, 8.65, 9.6, 10.2, 8.8, 9.8) What are the null and alternative hypotheses? What do you conclude? What would have happened if we had completed a two-sided test? Comparison of one-sided and twosided tests Two-sided p-value is always twice one-sided pvalue. Two-sided test is more conservative because the rejection region is split between the high and low side. For the one-sided test, the rejection region is only on the side of interest Two-sided test most common in literature even though usually people know the direction of effect they are interested in detecting. Picture Wait a minute Up to now, assumed we know the population variance (is this a good assumption?) How could we estimate the population variance? – Sample variance!!! 1 n 2 s xi x n 1 i 1 2 – Is the sample variance exactly equal to population variance? – How can we account for the additional uncertainty? Now, we need to do a little math t-distribution Assume Xi are iid normal X m Normal distribution ~ N (0,1) n Chi-square distribution (Proof of this is given in Casella and Berger and in Inference I) (n 1) S 2 2 ~ n21 t-distribution- ratio of Normal (U) and chisquare (V) X m ( X m ) ( n ) U ~ t S n (n 1) S 2 (n 1) 2 V (n 1) n 1 t-distribution Heavier tails than normal distribution – Accounts for additional variability – Tails heavier with fewer degrees of freedom (dof) As dof goes to infinity, t dist normal dist Can use t-dist test statistic just as the previous Remember assumption of underlying normal Histogram of second 3 2 1 We can use a t-test to test the second null hypothesis about our pregnant teens, namely that the time from the first visit to the second visit is the same as in the general population First, we need to ensure that the underlying distribution is approximately normal 0 Frequency 4 5 6 Example 2 4 6 second 8 10 Steps for hypothesis testing Null: m=4.3 weeks, Alternative: m != 4.3 weeks 2) One sample hypothesis t-test, alpha=0.05 3) x m 5.97 4.8 1) t34 s n 2.04 35 3.4 p-value = 0.0017 5) Reject null 6) Conclusion: There is a difference in the amount of time from the first visit to the second visit. The time is longer for the pregnant teens. 4) One sample t-test in R To complete a t-test in R, use > t.test(second,mu=4.8) One Sample t-test data: second t = 3.4035, df = 34, p-value = 0.00172 alternative hypothesis: true mean is not equal to 4.8 95 percent confidence interval: 5.271960 6.670897 sample estimates: mean of x 5.971429 Practice Using the class data set, test the following hypotheses: – The average age of an incoming student to the biostat program is 25. Is the mean age of this year’s class significantly different? Is there anything we need to consider in this analysis? – The average height of an incoming student is 71 inches. Is the mean height of this year’s class significantly shorter? More practice The TV watching habits of my seventh grade classes are shown in the dataset TV.dat from the course website. The gender and age of the students is given as well. How did my students TV watching habits compare to the national average for 7th graders of 4 hours/day? Use an alpha level of 0.01.