Inference for Regression

advertisement
AP Statistics
Chapter 15 – Inference for Regression
Mr. Dooley
NAME_________________________________
Example #1: There is evidence that drinking moderate amounts of wines can reduce death from heart
attacks. Before a randomized experiment is conducted researchers want to see if there is even a
correlation between wine consumption and heart attack deaths. The table below gives yearly wine
consumption (liters per person) and yearly deaths from heart attacks (deaths per 10,000 people) in 12
developed nations:
Country
Wine
Consumption
3.9
2.4
2.9
9.1
0.7
7.9
Austria
Canada
Denmark
France
Ireland
Italy
Heart attack
deaths
167
191
220
71
300
107
Country
Spain
Sweden
Switzerland
UK
US
Germany
Wine
Consumption
6.5
1.6
5.8
1.3
1.0
1.7
Heart attack
deaths
86
207
115
285
199
172
a.) Which is the response and which is the explanatory variable?
Heart Attack Deaths / 10000
people
b.) As always, the first thing to do is examine the data and in this case the best way to examine the
data is to look at a scatterplot. Does there appear to be a linear relationship between wine consumption
and heart attack deaths?
350
300
250
200
150
100
50
0
0
2
4
6
8
Wine Consumption (liters / Person)
10
c.) What is the least squares regression line (LSRL) for this data?
d.) What is the correlation coefficient, r, and the coefficient of determination, r2? What does r2 mean in
the context of the problem? The correlation coefficient, r, is fairly strong in the negative direction. Can
we therefore conclude that wine consumption reduces heart attack deaths?
e.) If it is found that Japan consumes an average of 4.2 liters of wine per person, what would be your
prediction of the rate of heart attack deaths there?
f.) What is the residual for Austria? (observed y – predicted y)
g.) Suppose that in the Ukraine it was found that the average consumption of wine per person is 24
liters. What is your predicted number of heart attack deaths? How comfortable are you with your
prediction?
h.) Create a residual plot for the data (residual = observed y – predicted y)
i.) Does the residual plot appear to have any pattern?
j.) Are the any outliers or influential points?
Doing Inference:
The slope, b, and the y-intercept, a, would no doubt be different if we were to pick other countries for
our sample. Therefore, we can think of our regression equation as only a model based on a sample.
This model is the model of a “true regression line” that is based on an “on the average” straight
relationship between wine consumption and death due to heart attacks. This “true regression line”
takes the form:  y    x where the slope  and intercept  are unknown parameters.
Our goals in doing statistical inference are:
1) To create a confidence interval for the slope 
2) Test the hypothesis that the slope  is equal to zero (indicating no relationship between the
two variables)
The standard deviation,  , of y can be estimated by looking at the standard error, s, from our
data. We want to find out how scattered are our data points around the regression line. We
estimate this scattering by looking at how scattered our data points are from our sample. The
formula for doing this is:
s
1
1
residual 2 
(y - ŷ) 2


n2
n2
where s is the standard error about the line.
What is the standard error about the line in our example?
s
1
1
residual 2 
(y - ŷ) 2  36.081


n2
n2
Confidence Intervals for Regression Slope:
How good of a predictor is the slope of the regression line from our sample? We can create a
confidence interval of the slope of a regression line using the same recipe for creating any confidence
interval:
From the stat packet:
statistic ± (critical value) · (standard deviation of statistic)
For slope, this formula becomes:
b  t  SE b
The standard error of the slope, SEb, can be found by the following formula:
SEb 
s
 (x  x)
2
To find t  use n – 2 degrees of freedom.
Find SEb and the 95% confidence interval for the slope of the regression line in our example.
Hypothesis Testing: We will test the hypothesis that the slope of the true regression line is equal to
zero (alpha level of 0.05). This would indicate whether or not there is a relationship between the 2
quantitative variables. If the slope is zero we are effectively saying that the correlation between the two
variables is zero.
Step 1: State your hypothesis
Ho :   0
Ha :   0
Step 2: Assumptions
See above
Step 3: Calculate test statistic
t
b  0  22.276

 5.918
SEb
3.764
degrees of freedom = n – 2 = 12 – 2 = 10
Calculate the p-value
p-value = P(t<-5.918) = 0.000074
Step 4: Conclusion
Since the p-value is less than the alpha level of 0.05 we can reject the null hypothesis in favor of the
alternative and say that the regression line slope is less than zero. We can conclude that there is a
relationship between wine consumption and reduction of heart attack deaths. As wine consumption
increases, death due to heart attacks decreases.
Using Computer Output: More often than not, regression analysis is not done manually (due to the
tedious nature of calculating the necessary statistics), but is done on computer software. Though the
output from computer software looks slightly different, the information that we would need to come to
a conclusion is all there – we just need to sift through the numbers and extract what we need
Example 1: The following is the output for data that compared the performance of a golfer in the first
and second round of a tournament. The data are in the table below:
Golfer 1
Round1 89
Round2 94
2
90
85
3
87
89
4
95
89
5
86
81
6
81
76
7
102
107
8
105
89
9
83
87
10
88
91
11
91
88
12
79
80
1) What is the regression equation for this data?
2) What is the standard error about the line?
3) What is SEb?
4) What is the t statistic for the slope of the regression line?
5) What two numbers were used to calculate this t statistic?
6) What is the p-value associated with this t-statistic?
7 Construct a 95% confidence interval for the slope of the true regression line.
Example 2: The table below gives data that compares the relationship between the number of
“degree-days” per month, and the amount of gas consumption for that month. The output below the
table summarizes the regression information for the data.
1) What is the regression equation for this data?
2) What is the standard error about the line?
3) What is SEb?
4) What is the t statistic for the slope of the regression line?
5) What two numbers were used to calculate this t statistic?
6) What is the p-value associated with this t-statistic?
7) Construct a 95% confidence interval for the slope of the true regression line.
Download