Stat 301 – Final Exam Name: ________________________ December 20, 2013

advertisement
Stat 301 – Final Exam
December 20, 2013
Name: ________________________
INSTRUCTIONS: Read the questions carefully and completely. Answer each question
and show work in the space provided. Partial credit will not be given if work is not
shown. Use the JMP output. It is not necessary to calculate something by hand that JMP
has already calculated for you. When asked to explain, describe, or comment, do so
within the context of the problem. Be sure to include units when discussing quantitative
variables.
1. [14 pts] This question deals with the three model selection procedures; Forward,
Backward and Mixed, in general.
a) [3] Explain what the Prob to Enter is.
b) [4] For the Forward selection procedure, what is the first variable that will enter
the model?
c) [3] Explain what the Prob to Leave is.
d) [4] For the Backward selection procedure, what is the first variable that will be
removed from the model?
1
2. [26 pts] On the first two labs this semester we looked at the concentration of NonStructural Carbohydrates (NSC in mg/g) for trees and shrubs in dry and moist tropical
forests. Researchers are interested in using the NSC concentrations to predict the
concentration of sugar (in mg/g) for those tropical trees and shrubs. For the multiple
regression model the centered value of NSC, NSC – 65.885, is used. The Forest
Indicator is 0 if it is a Moist tropical forest and 1 if it is a Dry tropical forest.
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
Analysis of Variance
Source
DF Sum of Squares
Model
3
12684.881
Error
83
7048.866
C. Total
86
19733.747
0.643
0.630
9.216
29.161
87
Mean Square
4228.29
84.93
Parameter Estimates
Term
Estimate
Intercept
28.296
(NSC – 65.885)
0.283
Forest Indicator
4.245
Forest Indicator*(NSC – 65.885)
0.250
Std Error
1.3365
0.0328
2.0832
0.0703
F Ratio
49.79
Prob > F
<.0001*
t Ratio
21.17
8.63
2.04
3.55
Prob>|t|
<.0001*
<.0001*
0.0448*
0.0006*
a) [2] How much variation in sugar concentration can be explained by the model
with (NSC – 65.885), Forest Indicator and the interaction between these two
variables?
2
b) [4] Is the model useful? Support your answer with the value of the appropriate
test statistic, P-value and conclusion based on the P-value.
c) [4] Give an interpretation of the estimated intercept within the context of the
problem.
d) [4] Give an interpretation of the parameter estimate for (NSC – 65.885) within the
context of the problem.
e) [4] Give an interpretation of the parameter estimate for the Forest Indicator
variable within the contest of the problem.
f) [4] Because the P-value for the parameter estimate for Forest Indicator*(NSC –
65.885) is so small, the Forest Indicator*(NSC – 65.885) term is statistically
significant. What does this indicate about the relationship between sugar
concentration and NSC concentration?
3
g) [4] Is there a statistically significant difference in average sugar concentrations for
tropical trees and shrubs with 65.885 mg/g NSC in Moist and Dry tropical
forests? Support your answer with the value of the appropriate test statistic, Pvalue and conclusion based on the P-value.
3. [37 pts] Standardized residuals, leverage (h) values, Cook’s D, and Studentized
residuals are calculated for each of the n = 87 species using the fitted model in
problem 2.
a) [5] The distribution of standardized residuals is given below. What does this
indicate about the normal distribution condition? Be sure to make specific
reference to the plots to support your answer.
Standardized Residuals
4
b) [3] The plot of standardized residuals versus the type of forest is given below.
What does this plot tell you about the conditions necessary for statistical
inference?
Level
Dry
Moist
n Mean Std Dev
38
0
0.974
49
0
0.999
c) [5] The four most extreme standardized residuals are given below. Which are
statistically significant? Explain briefly.
Species
Ocotea sp. 1
Pseudolmedia laevis
Zeyheria tuberculosa
Pouteria macrophylla
Forest
Moist
Moist
Dry
Moist
Sugar
52
48
81
57
NSC
230
69
92
103
Std Resid, z
–2.465
2.042
3.749
1.975
P-value, z
0.0137
0.0411
0.0002
0.0482
d) [3] How large does the leverage, h, have to be to be considered high leverage?
5
The ten largest leverage values are given below. The average NSC for trees and shrubs
in Dry tropical forests is 56.8 mg/g while the average NSC for trees and shrubs in Moist
tropical forests is 72.9 mg/g. The average sugar concentration for Dry tropical forest
trees and shrubs is 27.7 mg/g while the average sugar concentration for Moist tropical
forest trees and shrubs is 30.3 mg/g.
Species
Ocotea sp. 1
Ocotea sp. 2
Pourouma cecropiifolia
Pouteria nemorosa
Swietenia macrophylla
Bougainvillea modesta
Chrisyophyllum gonocarpon
Neea cf. steimbachii
Pouteria gardneriana
Zeyheria tuberculosa
Forest
Moist
Moist
Moist
Moist
Moist
Dry
Dry
Dry
Dry
Dry
Sugar
52
58
38
56
50
12
59
11
65
81
NSC
230
137
134
128
169
24
129
19
142
92
h Sugar
0.333
0.072
0.068
0.059
0.137
0.075
0.264
0.092
0.357
0.083
F
1.82
1.67
1.39
4.03
1.91
9.48
2.44
14.87
2.15
P-value, F
0.0000
0.1505
0.1806
0.2512
0.0099
0.1339
0.0000
0.0704
0.0000
0.1002
e) [5] Calculate the F statistic for Ocotea sp. 1.
f) [4] What species of tropical trees and shrubs have statistically significant leverage
values? Explain briefly.
g) [3] What is the reason for the statistically significant leverage?
6
The 5 trees and shrubs with either the largest values of Cook’s D or the most extreme
Studentized residuals are given below.
Species
Ocotea sp. 1
Pouteria macrophylla
Pseudolmedia laevis
Pouteria gardneriana
Zeyheria tuberculosa
Forest
Moist
Moist
Moist
Dry
Dry
Cook's D
1.136
0.033
0.022
0.166
0.346
Studentized Resid, t
–3.02
2.01
2.06
–1.09
3.91
P-value, t
0.0034
0.0479
0.0422
0.2771
0.0002
h) [2] Which tree/shrub species has the largest value of Cook’s D? Give the name of
the true/shrub species and the value of Cook’s D?
i) [2] Is this considered a highly influential value? Explain briefly.
j) [2] Which tree/shrub species has the most extreme Studentized residual? Give the
name of the tree/shrub species and the value of the Studentized residual.
k) [3] Does the tree/shrub species with the most extreme Studentized residual have
statistically significant influence? Support your answer.
7
4. [48 pts] A random sample of 100 houses was selected from all houses sold in Ames in
2009-2010. We are interested in building a model for Sales Price ($1000) based on
characteristics of the houses. There are 12 explanatory variables for each house.
Lot Area
Quality
Condition
Age
Basement Area
First Floor Area
Second Floor Area
Baths
Bedrooms
Other Rooms
Fireplaces
Garage Cars
Area of the lot in square feet
Index of quality 0 = low, 10 = high
Index of condition 0 = poor, 10 = excellent
Age of house at the time of sale
Area of the basement in square feet
Area of the first floor in square feet
Area of the second floor in square feet
Number of baths, Note: a half bath = ½
Number of bedrooms
Number of other rooms
Number of fireplaces
Size of garage in terms of number of cars
Below is the output for the Forward selection procedure with Prob to Enter = 0.25.
The C. Total sum of squares is 423164.51.
8
a) [5] What is the best single variable model for predicting Sales Price? How do you
know this is the best single variable model? How much variability in Sales Price
does this single variable explain?
b) [5] What variable was added to the model on the fourth step of the forward
selection procedure? What was the P-value for this variable when it was entered?
How much additional variability in Sales Price does adding this variable explain,
given the variables entered in the first three steps?
c) [4] Could the 10 variable model (Lot Area, Quality, Condition, Age, Basement
Area, First Floor Area, Second Floor Area, Bedrooms, Other Rooms, and Garage
Cars) be the best model? Explain briefly.
9
Below is the initial set up for the Backward selection procedure with Prob to Leave =
0.10.
d) [5] What will be the first variable removed from the full model using the
backward selection procedure? Give the P-value associated with this action and
indicate how the value of R2 will change and by how much when that variable is
removed.
e) [5] After the first two variables are removed by the Backward selection procedure
the Current Estimates are the same as those given at the end of the Forward
selection procedure (see page 8). What will happen at the next step of the
Backward procedure? Explain what will happen and why.
10
Running the Mixed procedure with Prob to Enter = Prob to Leave = 0.05 gives the
Current Estimates given below.
f) [5] Complete the analysis of variance table that corresponds to the model with the
terms checked as Entered (Lot Area, Quality, Condition, Age, Basement Area,
First Floor Area, Second Floor Area, Bedrooms and Garage Cars).
Source
df
Sum of Squares
Model
Mean Square
F
Prob > F
<0.0001
Error
C. Total
99
423164.51
g) [4] Could the model with the terms checked as Entered be the best model?
Explain briefly.
11
The All Possible Models option in Stepwise was used to display the top 10 (according to
R2) models for 9, 10, and 11 variable models along with the full 12 variable model.
Below are summary values of various measures of how well the model fits the data. Use
this output when answering parts h) – l).
Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# variables
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
10
11
11
11
11
11
11
11
11
11
11
12
RSquare
0.8692
0.8634
0.8615
0.8609
0.8606
0.8595
0.8594
0.8592
0.8591
0.8586
0.8727
0.8695
0.8692
0.8650
0.8635
0.8629
0.8629
0.8626
0.8619
0.8617
0.8733
0.8729
0.8695
0.8650
0.8634
0.8630
0.8627
0.8620
0.8614
0.8533
0.8734
RMSE
24.8034
25.3390
25.5224
25.5724
25.6011
25.6983
25.7101
25.7337
25.7365
25.7812
24.6001
24.9061
24.9419
25.3368
25.4781
25.5298
25.5318
25.5616
25.6204
25.6400
24.6840
24.7188
25.0465
25.4803
25.6304
25.6682
25.6986
25.7572
25.8161
26.5645
24.8167
AICc
940.4476
944.7207
946.1632
946.5546
946.7786
947.5363
947.6288
947.8120
947.8337
948.1805
940.2703
942.7430
943.0307
946.1721
947.2843
947.6897
947.7058
947.9384
948.3985
948.5516
942.4679
942.7496
945.3840
948.8178
949.9927
950.2874
950.5242
950.9796
951.4369
957.1521
945.1060
BIC
966.1045
970.3776
971.8201
972.2114
972.4355
973.1931
973.2856
973.4689
973.4906
973.8374
967.9461
970.4188
970.7066
973.8479
974.9602
975.3655
975.3817
975.6142
976.0743
976.2274
972.1025
972.3843
975.0187
978.4525
979.6274
979.9221
980.1588
980.6142
981.0716
986.7867
976.6372
Cp
9.9034
13.8284
15.1916
15.5649
15.7793
16.5077
16.5970
16.7741
16.7952
17.1314
9.4529
11.6423
11.9007
14.7696
15.8072
16.1882
16.2034
16.4227
16.8581
17.0035
11.0614
11.3070
13.6376
16.7690
17.8654
18.1424
18.3656
18.7963
19.2308
24.8319
13.0000
h) [2] Which model has the best R2 value? Identify by giving the model number,
number of variables and value of R2.
12
i) [2] Which model has the best RMSE value? Identify by giving the model
number, number of variables and value of RMSE.
j) [2] Which model has the best Cp value? Identify by giving the model number,
number of variables and value of Cp.
k) [2] Which model has the best AICc value? Identify by giving the model number,
number of variables and value of AICc.
l) [2] Which model has the best BIC value? Identify by giving the model number,
number of variables and value of BIC.
m) [5] Model 1 includes variables Lot Area, Quality, Condition, Age, Basement
Area, First Floor Area, Second Floor Area, Bedrooms and Garage Cars. Model 11
includes variables Lot Area, Quality, Condition, Age, Basement Area, First Floor
Area, Second Floor Area, Bedrooms, Other Rooms and Garage Cars. Between
these two models, which is best according to the definition of best we have been
using in this course? Explain briefly.
13
Download