An old midterm exam, in compressed format

advertisement
Midterm Exam
Stat 530
Spring 2007
Name ______________________________
Exam Rules: This exam is closed book, closed notes. However, you are permitted to have one
cheat sheet (regular sized paper, with writing on both sides). Use of a calculator is recommended.
When answering the problems below, make your methods clear. You may make use of any of the
Minitab output that you see on the exam. You do not need to redo calculations if they appear on the
output. There are a total of 40 points on the exam. When performing a hypothesis test, use a level
of  = 0.05. If using a p-value to make the decision for a test, reject if the p-value equals . If in
need of a t-multiplier or critical value, use 2, but clearly state the degrees of freedom. If in need of
an F critical value, use 3, but clearly state the degrees of freedom for the test.
1. Data was collected on economic expenditures and several other variables in 48 U.S. states (Alaska and
Hawaii are excluded). The data were analyzed in an attempt to construct a model describing expenditures
(Expen) as a function of
Ecab - economic ability index, created from a variety of other variables
Met - percentage of population living in metropolitan areas
Met^2 - Met squared
Grow - percent change in population, 1950 - 1960
Young - percent of population aged 5 - 19 years
Old - percent of population aged more than 65 years
West - an indicator of western states
Matrix Plot of Expen, Ecab, Met
60
120
180
400
Expen
300
200
180
Ecab
120
60
80
40
Met
0
200
300
400
0
40
80
a. (3 pts) Assume that we will fit the model Expen = Ecab. Circle the case with the highest leverage on
the plot of Expen vs. Ecab. Circle the same point on the plots of Expen vs. Met and Ecab vs. Met.
b. (2 pts) These data were run through a stepwise selection procedure (see the output on the Data Sheet).
Describe the final model resulting from the stepwise procedure in short-hand notation.
c. (3 pts) If you were to remove one variable from the model in part b, which variable would you remove?
Why? Make your decision on the basis of the limited output available to you.
d. (2 pts) The plot below is a normal probability plot of the standardized residuals after fitting a
particular regression model to these data. Do the residuals appear to be normally distributed? If not,
briefly describe the departure(s) from normality.
Probability Plot of SRES1
Normal - 95% CI
99
Mean
StDev
N
AD
P-Value
95
90
-0.01778
1.058
48
0.269
0.666
Percent
80
70
60
50
40
30
20
10
5
1
-3
-2
-1
0
SRES1
1
2
3
2. Forbes collected data relating the boiling point of water to the logarithm of barometric pressure.
Another investigator, Hooker, did the same. In this problem, we seek to create a model that
encompasses both investigators data. Forbes' apparent outlier was removed before running the data
through Minitab. The variables that are under consideration are:
Temp
Temp^2
LPres
Hooker
H * Temp
the boiling point of water, measured in degrees Farenheit
the square of Temp
100 times the logarithm (base 10) of barometric pressure, in inches of mercury
0 if the case was collected by Forbes, 1 if collected by Hooker
the product of Hooker and Temp
The Data Sheet contains a variety of Minitab output that will be useful for this problem. Some of the
output has been trimmed--for example, the sequential sums of squares and unusual observations portion
of the output has been deleted from most of the regressions.
a. (5 pts) Is the model of separate regression lines for Hooker and Forbes useful for predicting LPres?
Perform a formal F-test for model utility at the 0.05 level. Clearly state your null and alternative
hypotheses, state the degrees of freedom for the test, and reach a conclusion.
b. (7 pts) You have output for several models on the Data Sheet. Among the models which can be
described as "a single regression line", "parallel regression lines" for the two investigators, and "separate
regression lines" for the two investigators, which seems to be most appropriate?
Base your choice on a set of hypothesis tests. Make your method clear.
c. (3 pts) Do Forbes' and Hooker's data appear to follow a single multiple linear regression model?
Briefly comment on the assumptions of linearity and constant variance. If they do not follow a single
model, how would you modify the model to better fit the data?
d. (5 pts) Peform a formal F-test for lack of fit of the model LPres = Temp. Clearly state your null and
alternative hypotheses, and make your method clear.
e. (3 pts) Use the Minitab output on the Data Sheet to compare the model LPres = Temp and the model
LPres = Temp + Temp^2. Do the data suggest the need for a quadratic term in the model? If your answer
to this part seems to differ from that in part d, how do you reconcile your two answers?
f. (5 pts) Using the model LPres = Temp, form a 95% confidence interval for the median barometric
pressure when the boiling point of water is 186.0 degrees. Note: Case 21 has a boiling point of water of
186.0 degrees. Note: This question is about barometric pressure, not 100 * log(base 10) barometric
pressure!
g. (2 pts) It is possible that Hooker's and Forbes' thermometers were calibrated differently. Suppose that
miscalibration, if it exists, is additive. For example, Forbes' thermometer might always measure 0.2
degrees warmer than Hooker's thermometer. Translate this notion of miscalibration into a formal
hypothesis that would generalize the model LPres = Temp. State your hypothesis in terms of a welldefined parameter. Use the output from the Data Sheet to test your hypothesis.
DATA SHEET
Problem 1.
Stepwise Regression: Expen versus Ecab, Met, ...
Alpha-to-Enter: 0.15
Alpha-to-Remove: 0.15
Response is Expen on 7 predictors, with N = 48
Step
Constant
1
119.1
2
143.3
3
174.3
Ecab
T-Value
P-Value
1.40
3.65
0.001
1.37
4.80
0.000
1.41
4.98
0.000
Met
T-Value
P-Value
-3.04
-4.01
0.000
-3.04
-4.06
0.000
-3.00
-4.01
0.000
Grow
T-Value
0.70
1.83
0.69
1.90
0.52
1.60
P-Value
0.074
Young
T-Value
P-Value
0.6
0.09
0.931
Old
T-Value
P-Value
4.1
0.63
0.534
3.6
1.04
0.304
West
T-Value
P-Value
34
2.78
0.008
34
3.08
0.004
35
3.12
0.003
Met^2
T-Value
P-Value
0.0309
3.45
0.001
0.0307
3.64
0.001
0.0305
3.62
0.001
35.4
69.13
63.73
8.0
35.0
69.12
64.61
6.0
35.0
68.31
64.54
5.1
S
R-Sq
R-Sq(adj)
Mallows C-p
0.064
0.116
Problem 2.
Descriptive Statistics: LPres
Variable
LPres
Hooker
0
1
N
16
31
N*
1
0
Variable
LPres
Hooker
0
1
Q3
145.19
134.10
Mean
139.43
129.44
SE Mean
1.32
1.42
StDev
5.29
7.93
Minimum
131.79
118.68
Maximum
147.80
146.55
Regression Analysis: LPres versus Temp
The regression equation is
LPres = - 43.8 + 0.903 Temp
Predictor
Constant
Temp
Coef
-43.8105
0.903329
S = 0.302950
SE Coef
0.9251
0.004725
R-Sq = 99.9%
T
-47.36
191.18
P
0.000
0.000
R-Sq(adj) = 99.9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
45
46
SS
3354.4
4.1
3358.6
MS
3354.4
0.1
F
36548.96
P
0.000
Unusual Observations
Obs
21
22
31
Temp
186
186
181
LPres
123.606
123.203
118.684
Fit
124.209
123.847
119.331
SE Fit
0.063
0.065
0.083
Residual
-0.603
-0.644
-0.646
St Resid
-2.03R
-2.18R
-2.22R
Q1
135.77
122.74
Median
138.02
128.75
R denotes an observation with a large standardized residual.
Regression Analysis: LPres versus Temp, Hooker
The regression equation is
LPres = - 43.8 + 0.904 Temp + 0.006 Hooker
Predictor
Constant
Temp
Hooker
Coef
-43.849
0.903506
0.0063
S = 0.306363
SE Coef
1.173
0.005770
0.1139
R-Sq = 99.9%
T
-37.38
156.59
0.05
P
0.000
0.000
0.956
R-Sq(adj) = 99.9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
2
44
46
SS
3354.4
4.1
3358.6
MS
1677.2
0.1
F
17869.60
P
0.000
Problem 2, continued
Regression Analysis: LPres versus Temp, H * Temp
The regression equation is
LPres = - 43.9 + 0.904 Temp + 0.000053 H * Temp
Predictor
Constant
Temp
H * Temp
Coef
-43.868
0.903590
0.0000535
S = 0.306343
SE Coef
1.119
0.005523
0.0005667
R-Sq = 99.9%
T
-39.21
163.59
0.09
P
0.000
0.000
0.925
R-Sq(adj) = 99.9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
2
44
46
SS
3354.4
4.1
3358.6
MS
1677.2
0.1
F
17872.00
P
0.000
Regression Analysis: LPres versus Temp, Hooker, H * Temp
The regression equation is
LPres = - 41.3 + 0.891 Temp - 3.06 Hooker + 0.0153 H * Temp
Predictor
Constant
Temp
Hooker
H * Temp
Coef
-41.335
0.89111
-3.056
0.01525
S = 0.306137
SE Coef
2.704
0.01332
2.970
0.01478
R-Sq = 99.9%
T
-15.29
66.88
-1.03
1.03
P
0.000
0.000
0.309
0.308
R-Sq(adj) = 99.9%
Analysis of Variance
Source
Regression
DF
3
SS
3354.5
MS
1118.2
F
11931.04
P
0.000
Residual Error
Total
43
46
4.0
3358.6
0.1
Regression Analysis: LPres versus Temp, Temp^2
The regression equation is
LPres = - 82.3 + 1.30 Temp - 0.00100 Temp^2
Predictor
Constant
Temp
Temp^2
Coef
-82.35
1.2973
-0.0010047
S = 0.293182
SE Coef
19.17
0.1959
0.0004993
R-Sq = 99.9%
T
-4.29
6.62
-2.01
P
0.000
0.000
0.050
R-Sq(adj) = 99.9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
2
44
46
SS
3354.8
3.8
3358.6
MS
1677.4
0.1
F
19514.47
P
0.000
Problem 2, continued
One-way ANOVA: LPres versus Temp
Source
Temp
Error
Total
DF
44
2
46
S = 0.3026
SS
3358.369
0.183
3358.552
R-Sq =
MS
76.327
0.092
99.99%
F
833.50
P
0.001
R-Sq(adj) = 99.87%
Scatterplot of SRES7 vs FITS7
2
Hook er
0
1
SRES7
1
0
-1
-2
120
125
130
135
FITS7
140
Standardized residuals vs. fits from the model LPres = Temp
145
150
Download