Stat 401G Lab 9: Solution Fall 2012

advertisement
Stat 401G
Lab 9: Solution
Fall 2012
Below are the Olympic Gold Medal 200 m dash times for women and men from 1948 through
2004.
Year
1948
1952
1956
1960
1964
1968
1972
1976
Women’s Time
24.40
23.70
23.40
24.00
23.00
22.50
22.40
22.37
Men’s Time
21.10
20.70
20.60
20.50
20.30
19.80
20.00
20.23
Year
1980
1984
1988
1992
1996
2000
2004
2008
Women’s Time
22.03
21.81
21.34
21.81
22.12
21.84
22.05
Men’s Time
20.19
19.80
19.75
20.01
19.32
20.09
19.79
We want to be able to predict the gold medal time for the 200 m dash at the 2008 Olympics in
Beijing. We also wish to investigate how the winning times have changed over the past 60 years
for both women and men.
1. Consider the combined set of women’s and men’s times. Fit a simple linear regression with
time as the response and year as the explanatory variable.
a) Give the least squares prediction equation.
Predicted Time = 85.01 – 0.03221*Year
b) Give an interpretation of the estimated slope coefficient.
If we increase Year by 4 (the Olympics occur only every four years), the 200 m dash
time will decrease, on average, 0.03221*4 = 0.129 seconds.
c) Why can’t we interpret the estimated intercept within the context of the problem?
Although the year 0 makes sense, the modern Olympics were not held back then.
Because the first year of data is 1948, year 0 is extrapolating way beyond the data.
d) How much of the variability in time is explained by year?
The value of R2 is 0.157 so 15.7% of the variation in time is explained by the linear
relationship with year.
e) What do you notice about the plot of residuals versus time? What does this indicate?
There are two distinct groups of points. One group is entirely above the zero line,
the other group is entirely below the zero line. These two groups correspond to the
women’s and men’s times. Using a variable to differentiate between men and
women would probably improve the fit of the model.
2. Fit a multiple linear regression with time as the response and (year – 1976), an indicator
variable for gender (Gender = 0 if female, Gender = 1 if male) and a (year – 1976) by gender
interaction term.
1
a) How much of the variability in time is explained by this model?
The value of R2 is 0.935 so 93.5% of the variation in time is explained by the
interaction model with (Year – 1976) and Gender.
b) Give the least squares prediction equation for this model.
Predicted Time = 22.585 – 0.0442*(Year – 1976) – 2.439*Gender
+ 0.0240*(Year – 1976)*Gender
c) Interpret each of the estimated parameters within the context of the problem.
In 1976, (Year – 1976) = 0, the predicted 200 m dash time for women (Gender = 0) is
22.585, on average.
Holding Gender constant at 0 (women), a 4 year increase in (Year – 1976) will see a
decrease of 0.0442*4 = 0.177 seconds, on average. That is, the average decrease in
women’s 200 meter dash time from one Olympics to the next is 0.177 seconds.
In 1976, (Year – 1976), men are predicted to be 2.439 seconds faster than women, on
average.
The estimated coefficient for the interaction term (Year – 1976)*Gender is the
difference in the rates of change for men compared to women. On average, men’s
200 m dash times are decreasing at a slower rate than women’s, 0.0240 seconds per
year slower.
d) In 1976, were the predicted times for men and women statistically different? Support
your answer with the appropriate test or confidence interval.
Yes. This is just a test of significance for the estimated slope for Gender. The value
of the test statistic is either t = –17.41 or F = 303.21 with associated P-value < 0.0001.
The small P-value indicates a statistically significant difference between the genders
in 1976.
e) Are women’s and men’s times changing at statistically different rates? Support your
answer with the appropriate test or confidence interval.
Yes. This is just a test of significance for the estimated slope for the interaction
term. The value of the test statistic is either t = 2.96 or F = 8.75 with associated Pvalue of 0.0065. The small P-value indicates that the women’s and men’s times are
changing at different rates from year to year.
f) Predict the winning times for women and men at the 2008 Beijing Olympics. How do
these predictions compare to the actual times of 19.30 for men and 21.74 for women?
Women: Predicted time = 21.17 seconds
Men: Predicted time = 19.50 seconds
The women’s predicted time is too low and the men’s predicted time is a little high.
2
g) Describe the plot of residuals versus year. What does this indicate about the fit of the
model?
There is an indication of a curve in the plot of residuals versus year. Up through
1960 most of the points are above the zero line. From 1964 through 1988 most of the
points are below the zero line. From 1992 through 2004 most of the points are
above the zero line. Adding a quadratic term for year may improve the fit of the
model.
3. Fit a multiple linear regression with time as the response and (year – 1976), Gender,
Gender*(year – 1976), (year – 1976)2, and Gender*(year – 1976)2.
a) How much of the variability in time is explained by this model?
The value of R2 is 0.969 so 96.9% of the variation in time is explained by this model.
b) Does Gender*(Year – 1976)2 add significantly to the model? Support your answer
statistically.
No. The test statistic is either t = a – 1.88 or F = 3.53 with associated P-value of
0.0725. Because the P-value is not small (> 0.05) the Gender*(Year – 1976)2 does not
add significantly to the model with the other four variables in it.
c) What is the “best” model for predicting time? Give the prediction equation.
The “best” model includes the terms; (Year – 1976), Gender, (Year – 1976)*Gender
and (Year – 1976)2.
Predicted Time = 22.313 – 0.0442*(Year – 1976) – 2.4393*Gender +
0.0240*Gender*(Year – 1976) + 0.00091*(Year – 1976)2.
d) Use this “best” model to predict the men’s and women’s 200 m dash times for the 2008
Beijing Olympics. How do these predictions compare to the actual times of 19.30 for
men and 21.74 for women?
Women: Predicted time = 21.83 seconds
Men: Predicted time = 20.16 seconds
The prediction for men is too high but the prediction for women is pretty close.
e) Analyze the residuals for the “best” model. What does this analysis indicate about the
conditions of equal standard deviations, identically and normally distributed errors?
The plots of residuals versus the explanatory variables, Year and Gender, show
relatively equal spreads so the condition of equal standard deviations is probably
met.
The box plot shows a potential outlier at around +0.75. This indicates that the
identically distributed error condition is probably not met.
The histogram is mounded to the left of zero and skewed right. With the exception
of the one large residual, the box plot is fairly symmetric. Most of the points on the
Normal Quantile Plot fall close to the diagonal Normal model line. The normally
distributed error condition could go either way. There is some evidence against it
but that evidence is not that strong.
3
Download