AP Statistics Chapter 15 – Inference for Regression Mr. Dooley NAME_________________________________ Example #1: There is evidence that drinking moderate amounts of wines can reduce death from heart attacks. Before a randomized experiment is conducted researchers want to see if there is even a correlation between wine consumption and heart attack deaths. The table below gives yearly wine consumption (liters per person) and yearly deaths from heart attacks (deaths per 10,000 people) in 12 developed nations: Country Wine Consumption 3.9 2.4 2.9 9.1 0.7 7.9 Austria Canada Denmark France Ireland Italy Heart attack deaths 167 191 220 71 300 107 Country Spain Sweden Switzerland UK US Germany Wine Consumption 6.5 1.6 5.8 1.3 1.0 1.7 Heart attack deaths 86 207 115 285 199 172 a.) Which is the response and which is the explanatory variable? Heart Attack Deaths / 10000 people b.) As always, the first thing to do is examine the data and in this case the best way to examine the data is to look at a scatterplot. Does there appear to be a linear relationship between wine consumption and heart attack deaths? 350 300 250 200 150 100 50 0 0 2 4 6 8 Wine Consumption (liters / Person) 10 c.) What is the least squares regression line (LSRL) for this data? d.) What is the correlation coefficient, r, and the coefficient of determination, r2? What does r2 mean in the context of the problem? The correlation coefficient, r, is fairly strong in the negative direction. Can we therefore conclude that wine consumption reduces heart attack deaths? e.) If it is found that Japan consumes an average of 4.2 liters of wine per person, what would be your prediction of the rate of heart attack deaths there? f.) What is the residual for Austria? (observed y – predicted y) g.) Suppose that in the Ukraine it was found that the average consumption of wine per person is 24 liters. What is your predicted number of heart attack deaths? How comfortable are you with your prediction? h.) Create a residual plot for the data (residual = observed y – predicted y) i.) Does the residual plot appear to have any pattern? j.) Are the any outliers or influential points? Doing Inference: The slope, b, and the y-intercept, a, would no doubt be different if we were to pick other countries for our sample. Therefore, we can think of our regression equation as only a model based on a sample. This model is the model of a “true regression line” that is based on an “on the average” straight relationship between wine consumption and death due to heart attacks. This “true regression line” takes the form: y x where the slope and intercept are unknown parameters. Our goals in doing statistical inference are: 1) To create a confidence interval for the slope 2) Test the hypothesis that the slope is equal to zero (indicating no relationship between the two variables) The standard deviation, , of y can be estimated by looking at the standard error, s, from our data. We want to find out how scattered are our data points around the regression line. We estimate this scattering by looking at how scattered our data points are from our sample. The formula for doing this is: s 1 1 residual 2 (y - ŷ) 2 n2 n2 where s is the standard error about the line. What is the standard error about the line in our example? s 1 1 residual 2 (y - ŷ) 2 36.081 n2 n2 Confidence Intervals for Regression Slope: How good of a predictor is the slope of the regression line from our sample? We can create a confidence interval of the slope of a regression line using the same recipe for creating any confidence interval: From the stat packet: statistic ± (critical value) · (standard deviation of statistic) For slope, this formula becomes: b t SE b The standard error of the slope, SEb, can be found by the following formula: SEb s (x x) 2 To find t use n – 2 degrees of freedom. Find SEb and the 95% confidence interval for the slope of the regression line in our example. Hypothesis Testing: We will test the hypothesis that the slope of the true regression line is equal to zero (alpha level of 0.05). This would indicate whether or not there is a relationship between the 2 quantitative variables. If the slope is zero we are effectively saying that the correlation between the two variables is zero. Step 1: State your hypothesis Ho : 0 Ha : 0 Step 2: Assumptions See above Step 3: Calculate test statistic t b 0 22.276 5.918 SEb 3.764 degrees of freedom = n – 2 = 12 – 2 = 10 Calculate the p-value p-value = P(t<-5.918) = 0.000074 Step 4: Conclusion Since the p-value is less than the alpha level of 0.05 we can reject the null hypothesis in favor of the alternative and say that the regression line slope is less than zero. We can conclude that there is a relationship between wine consumption and reduction of heart attack deaths. As wine consumption increases, death due to heart attacks decreases. Using Computer Output: More often than not, regression analysis is not done manually (due to the tedious nature of calculating the necessary statistics), but is done on computer software. Though the output from computer software looks slightly different, the information that we would need to come to a conclusion is all there – we just need to sift through the numbers and extract what we need Example 1: The following is the output for data that compared the performance of a golfer in the first and second round of a tournament. The data are in the table below: Golfer 1 Round1 89 Round2 94 2 90 85 3 87 89 4 95 89 5 86 81 6 81 76 7 102 107 8 105 89 9 83 87 10 88 91 11 91 88 12 79 80 1) What is the regression equation for this data? 2) What is the standard error about the line? 3) What is SEb? 4) What is the t statistic for the slope of the regression line? 5) What two numbers were used to calculate this t statistic? 6) What is the p-value associated with this t-statistic? 7 Construct a 95% confidence interval for the slope of the true regression line. Example 2: The table below gives data that compares the relationship between the number of “degree-days” per month, and the amount of gas consumption for that month. The output below the table summarizes the regression information for the data. 1) What is the regression equation for this data? 2) What is the standard error about the line? 3) What is SEb? 4) What is the t statistic for the slope of the regression line? 5) What two numbers were used to calculate this t statistic? 6) What is the p-value associated with this t-statistic? 7) Construct a 95% confidence interval for the slope of the true regression line.