Statistics 401 C Name: Final Exam December 20, 2001

advertisement
Statistics 401 C
Final Exam
Name:
December 20, 2001
INSTRUCTIONS: Read the questions carefully and completely. Answer the questions and
show work in the space provided not on extra sheets. Credit will not be given if work is not
shown. Turn in the exam and printouts at the end of the examination period.
1. [35 pts] The small winged fruit of maple trees are called samara. When a samara falls, it
spins to the ground and, if conditions are right, starts to grow into a maple tree. A forest
scientist studied the velocity (Y) with which samara fall. Below are summaries of data for
random samples of samara from two different maple trees.
Tree 2
n2 = 12
Tree 1
n1 = 12
Y 1 = 1.23
Y 2 = 0.95
s1 = 0.098
s2 = 0.084
(a) [11] Use these summaries to test the hypothesis that the mean velocities for samara
from the two trees are the same against the alternative that they are different.
1
The analysis in (a) is criticized because it fails to account for a covariate, the load of
each samara. The load is a quantity based on the size and weight. Refer to the JMP
output entitled samara.
(b) [7] Consider the full model that predicts Velocity based on Load, an indicator variable
(Ind=0 if Tree 1 and Ind=1 if Tree 2) and the interaction between Load and Ind. Is the
interaction term statistically significant? Support your answer.
(c) [5] What does the result in (b) indicate about the linear relationship between Velocity
and Load for the two trees?
(d) [5] Compute the adjusted means for the two trees. Note that the average load is 0.20425.
(e) [7] Based on the adjusted means, are the two trees significantly different? Support your
answer.
2
2. [35 pts] Marine biologists made measurements on the density of coral in the Great Barrier
Reef off the coast of Australia. They also measured the distance to shore (km). Below is a
plot of the data. You should also refer to the JMP outputs: coralden-Fit Y by X.
(a) [5] From the plot, describe the relationship between Density and Distance from shore.
(b) [10] Comment on the least squares fit of Density on Distance. In particular; is Distance,
by itself, a significant predictor of Density? How much of the variability in Density is
explained by the simple linear fit? Is there a pattern in the plot of residuals?
3
(c) [10] Comment on the least squares fit of Density on Distance and Distance2 . In particular; does Distance2 add significant explanatory power to the simple linear fit? How
much of the variability in Density is explained by the model? Is there a pattern in the
plot of residuals?
(d) [10] Comment on the least squares fit of Density on Distance, Distance2 and Distance3 .
In particular; does Distance3 add significant explanatory power to the quadratic fit?
How much of the variability in Density is explained by the model? Is there a pattern in
the plot of residuals?
3. [45 pts] Our population of interest are Major League baseball players who played at least
one game in both the 1991 and 1992 seasons, excluding pitchers. A random sample of 80
players is taken from the population of interest. The 1992 salary is the response variable.
The explanatory variables relate to various performance measures. A list of the variables
appears below. Refer to the JMP output BBsalary. “Best” is defined as the highest R2 with
all variables significant at the 5% level.
• Salary: 1992 Salary in thousands of dollars
• BA: Batting average
• OBP: On base percentage
• Runs: Number of runs scored
4
• Hits: Number of hits
• Doubles: Number of doubles
• Triples: Number of triples
• HRs: Number of home runs
• RBI: Number of runs batted in
• Walks: Number of walks
• SOs: Number of strike outs
• SBs: Number of stolen bases
• Errors: Number of errors
• FAElig: Indicator of Free Agent Eligibility (Yes=1, No=0)
• FA91/2: Indicator of Free Agent 1991/92 (Yes=1, No=0)
• ArbElig: Indicator of Arbitration Eligibility (Yes=1, No=0)
• Arb91/2: Indicator of Arbitration 1991/92 (Yes=1, No=0)
• Name: Player’s name
(a) [6] Which of the variables, if used by itself in a simple linear regression, would provide
the highest predictive power? What is the value of R2 for this simple linear regression?
Is the simple linear regression using this variable statistically significant?
(b) [6] Using the Forward selection procedure, what variables are in the final model? Give
the R2 , adjR2 and Cp values for this final model. Could this be the “Best” model?
Explain briefly.
5
(c) [6] Using the Backward selection procedure, what variables are in the final model? Give
the R2 , adjR2 and Cp values for this final model. Could this be the “Best” model?
Explain briefly.
(d) [6] Using the Mixed selection procedure, what variables are in the final model? Give the
R2 , adjR2 and Cp values for this final model. Could this be the “Best” model? Explain
briefly.
(e) [5] Below are
Number in
Model
2
3
4
5
6
7
8
9
10
models with the highest R2 for various numbers of variables.
Variables
Cp in Model
R2 adjR2
0.6275 0.6178 30.9999 RBI FAElig
0.6675 0.6647 18.9151 RBI FAElig Arb91/2
0.6907 0.6742 17.1737 RBI FAElig ArbElig Arb91/2
0.7095 0.6898 13.8957 Runs HRs Walks FAElig Arb91/2
0.7319 0.7099
9.5622 Runs Hits RBI Walks FAElig ArbElig
0.7489 0.7245
6.7673 Runs Hits RBI Walks FAElig ArbElig
Arb91/2
0.7613 0.7345
5.2718 Runs Hits RBI Walks FAElig FA91/2
ArbElig Arb91/2
0.7670 0.7370
5.6915 Runs Hits Triples RBI Walks FAElig
FA91/2 ArbElig Arb91/2
0.7693 0.7358
7.0334 Runs Hits Triples RBI Walks SOs
FAElig FA91/2 ArbElig Arb91/2
6
• Does Forward selection find the 5 variable model with the highest R2 ?
• Does Backward selection find the 9 variable model with the highest R2 ?
• Does Mixed selection find the 9 variable model with the highest R2 ?
A “Best” model is found using 7 variables. The M SError is 431384.83. The analysis of
residuals from this model appears in the JMP output BBSalary-resid.
(f) [4] Lance Parish was paid 109 thousand dollars. The “Best” model predicts he would
be paid 1710.6 thousand dollars. What is the residual for Lance Parish? What is the
standardized residual?
(g) [6] What is the value of the most extreme studentized residual? Is this value significantly
different from zero? Use an overall level of 0.08 and adjust for the fact that you could
do 80 tests. Report the degrees of freedom you should use. If this number is not in your
t-table use the t value for the closest degrees freedom.
(h) [6] What is the value of the most extreme h value? Is this value significantly different
from zero? Use an overall level of 0.08 and adjust for the fact that you could do 80 tests.
Report the degrees of freedom you should use. If these are not not in your F-table use
the F value for the closest degrees freedom.
7
4. [10] In class we looked at the relationship between iris color (Blue, Brown and Green) and
the critical flicker frequency (cff) using the ANOVA. Another way to analyze these data is
with two dummy variables:
• X1 = 1 if iris color is Blue, X1 = 0 otherwise
• X2 = 1 if iris color is Brown, X2 = 0 otherwise
Refer to the JMP output entitled Eyecff-Fit Least Squares.
(a) [6] Give the prediction equation and an interpretation, within the context of the problem, of each of the estimated coefficients.
(b) [4] According to this analysis, are Blue eyes different from Green eyes? Brown eyes from
Green eyes? Support your answers and use a 0.05 level.
(c) [5 Extra Credit] Is there a significant difference between Blue and Brown eyes? You
must support your answer by reasoning from the information in (b).
8
Download