252y0741 5/7/07 Name and Class hour:______KEY_______________

advertisement
252y0741 5/7/07
ECO252 QBA2
Final EXAM
May , 2007
Version 1
Name and Class hour:______KEY_______________
I. (18+ points) Do all the following. Note that answers without reasons and citation of appropriate
statistical tests receive no credit. Most answers require a statistical test, that is, stating or implying a
hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it
lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere
Else)
In his text Allen L. Webster presents the following data. The hope was to explain three year returns
of the funds and to compare them to one year returns. The data given is as follows.
MTB > print c1 c2 c3 c4 c5 c6
Data Display
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
3YRR
5.6
4.7
4.5
4.8
5.7
4.1
4.7
4.1
5.2
3.7
6.2
6.6
5.2
5.5
5.6
1YRR
0.1
1.9
2.6
2.0
3.5
-4.3
3.2
-4.1
2.2
2.1
5.3
11.0
0.3
-2.1
4.7
LOAD
0
1
1
1
0
1
1
1
0
1
0
0
0
0
0
3YRT
112
95
241
87
98
102
72
96
78
118
98
87
117
87
85
1YRT
58
62
65
61
57
66
63
65
59
87
47
41
61
46
35
ASSETS
220.00
158.00
227.25
242.40
287.85
207.05
237.35
207.05
262.60
186.85
313.10
333.30
262.60
277.75
282.00
In the above, ‘1YRR’ is the rate of return over a 1-year period. ‘LOAD’ is a dummy variable that is 1 when
there is a load and zero if it is a no-load fund. ‘1YRT’ is a turnover rate for the fund and tells you the
percent of the fund bought or sold during the year, ‘ASSETS’ is the assets in $billions when the fund
opened. The remaining columns are for 3-year data and are not used here. I played with these data for a
while and got very discouraging results for the 3-rear data and nearly as discouraging results for the 1- year
data as you will see in Regression 3 below. I finally created the set of new independent variables that appear
below.
MTB > print c6 c7 c9 c11 c13 c14 c15
Data Display
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ASSETS
220.00
158.00
227.25
242.40
287.85
207.05
237.35
207.05
262.60
186.85
313.10
333.30
262.60
277.75
282.00
ASSTsq
48400
24964
51643
58758
82858
42870
56335
42870
68959
34913
98032
111089
68959
77145
79524
lnAssets
5.39363
5.06260
5.42605
5.49059
5.66244
5.33296
5.46954
5.33296
5.57063
5.23031
5.74652
5.80904
5.57063
5.62672
5.64191
1YRTsq
3364
3844
4225
3721
3249
4356
3969
4225
3481
7569
2209
1681
3721
2116
1225
AssetsL
0.00
158.00
227.25
242.40
0.00
207.05
237.35
207.05
0.00
186.85
0.00
0.00
0.00
0.00
0.00
1YRTL
0
62
65
61
0
66
63
65
0
87
0
0
0
0
0
ln1YRT
4.06044
4.12713
4.17439
4.11087
4.04305
4.18965
4.14313
4.17439
4.07754
4.46591
3.85015
3.71357
4.11087
3.82864
3.55535
‘ASSTsq’ is the square of the ‘ASSETS’ variable. ‘lnASSETS’ is the natural logarithm of the ‘ASSETS’
variable. ‘1YRTsq’ is the square of the one-year turnover. ‘AssetsL’ is an interaction variable, the product
of ‘ASSETS’ and ‘LOAD.’ ‘1YRTL’ is the product of ‘1YRT’ and ‘LOAD.’ Ln1YRT’ is the natural
logarithm of ‘1YRT.’
1
252y0741 5/7/07
————— 5/1/2007 9:16:52 PM ————————————————————
Welcome to Minitab, press F1 for help.
Results for: 252x07041-01A.MTW
MTB > let c14 = c5*c3
MTB > Stepwise c2 c3 c5 c6 c7 c9 c11 c13 c14;
SUBC>
Backward;
SUBC>
ARemove 0.1;
SUBC>
Best 0;
SUBC>
Constant.
Regression 1
Stepwise Regression: 1YRR versus LOAD, 1YRT, ...
Backward elimination. Alpha-to-Remove: 0.1
Response is 1YRR on 8 predictors, with N = 15
Step
Constant
1
1485
2
1458
3
1099
LOAD
T-Value
P-Value
-2
-0.04
0.971
1YRT
T-Value
P-Value
-2.08
-1.74
0.132
-2.12
-3.65
0.008
-2.10
-3.81
0.005
ASSETS
T-Value
P-Value
1.80
0.84
0.432
1.75
1.11
0.303
0.96
5.37
0.001
ASSTsq
T-Value
P-Value
-0.0009
-0.41
0.698
-0.0008
-0.50
0.630
-332
-1.23
0.265
-325
-1.73
0.127
-234
-5.03
0.001
1YRTsq
T-Value
P-Value
0.0216
1.74
0.133
0.0221
3.81
0.007
0.0218
3.98
0.004
AssetsL
T-Value
P-Value
0.256
2.23
0.067
0.253
3.73
0.007
0.251
3.90
0.005
1YRTL
T-Value
P-Value
-0.92
-1.26
0.255
-0.94
-3.61
0.009
-0.94
-3.79
0.005
S
R-Sq
R-Sq(adj)
Mallows C-p
2.03
87.79
71.51
9.0
1.88
87.79
75.57
7.0
1.79
87.34
77.85
5.2
lnAssets
T-Value
P-Value
More? (Yes, No, Subcommand, or Help)
SUBC> y
No variables entered or removed
More? (Yes, No, Subcommand, or Help)
SUBC> n
1) Regression 1 is a reverse stepwise equation that I ran on the ‘full’ model to see if there were any
obvious candidates for elimination. The third column is the 3 rd regression that was run. What can you say
about the significance of the coefficients of the variables that ‘stepwise’ forced out. Why? Do any of the
variables left in have a suspicious sign? (2)
2
252y0741 5/7/07
Solution: Both of the variables (‘LOAD’ and ‘ASSTsq’) had extremely high p-values (.971 and .630),
which means that their coefficients were not significant. Their lack of explanatory power is also brought out
by the fact that, while R-squared fell somewhat as variables were removed, R-squared adjusted actually
rose. We would expect most of the signs of the coefficients in the last column. There is nothing wrong
with a negative sign unless there is a good reason not to expect it. Generally a high turnover rate is
considered a bad reflection on investor’s opinion of a fund’s prospects, so it ought to have a negative sign.
We would expect the size of assets to have a positive effect on valuation so the positive sign of ‘ASSETS’
is reasonable. The negative sign of ‘lnAssets’ seems to indicate a nonlinear relationship between ‘1YRR’
and Assets which is offset by the positive sign of ‘Assets.’ The negative sign of ‘1YRTL’ is no surprise,
since it is made up of both ‘Load’ and ‘1YRT,’ both of which should depress the incentive of an investor to
buy the fund. It should be added that the extremely low value of C-p (5.2) relative to the number of
‘independent’ variables (6) and the fact that every coefficient has a p-value below 1% makes this equation
look good. In fact, the major liability of the equation is that analysts rarely use both a variable and its
logarithm in the same equation.
MTB > Regress c2 6
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
c5 c6
Regression 2
c9 c11 c13 c14;
Regression Analysis: 1YRR versus 1YRT, ASSETS, ...
The regression equation is
1YRR = 1099 - 2.10 1YRT + 0.962 ASSETS - 234
+ 0.251 AssetsL - 0.945 1YRTL
Predictor
Coef
SE Coef
T
P
Constant
1098.8
220.7
4.98 0.001
1YRT
-2.1015
0.5518 -3.81 0.005
ASSETS
0.9623
0.1792
5.37 0.001
lnAssets
-233.89
46.50 -5.03 0.001
1YRTsq
0.021819 0.005488
3.98 0.004
AssetsL
0.25128
0.06438
3.90 0.005
1YRTL
-0.9446
0.2491 -3.79 0.005
S = 1.79353
R-Sq = 87.3%
Analysis of Variance
Source
DF
SS
Regression
6 177.595
Residual Error
8
25.734
Total
14 203.329
Source
1YRT
ASSETS
lnAssets
1YRTsq
AssetsL
1YRTL
DF
1
1
1
1
1
1
lnAssets + 0.0218 1YRTsq
VIF
203.2
320.4
379.9
289.5
217.9
332.7
R-Sq(adj) = 77.9%
MS
29.599
3.217
F
9.20
P
0.003
Seq SS
38.826
33.901
48.554
4.363
5.702
46.249
2) Regression 2 is the regression suggested by the ‘stepwise’ command. What is most alarming about
it? (1)
Solution: The worst part of this is the high values of VIF, indicating that large amounts of collinearity are
present. Unless we know where this collinearity is coming from, we must be quite suspicious of the results.
3
252y0741 5/7/07
MTB > Regress c2 3
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression 3
c3 c5 c6 ;
Regression Analysis: 1YRR versus LOAD, 1YRT, ASSETS
The regression equation is
1YRR = - 13.1 + 1.76 LOAD - 0.011 1YRT + 0.0599 ASSETS
Predictor
Coef SE Coef
Constant
-13.11
13.22
LOAD
1.758
2.801
1YRT
-0.0106
0.1153
ASSETS
0.05992 0.03328
S = 3.38563
R-Sq = 38.0%
Analysis of Variance
Source
DF
SS
Regression
3
77.24
Residual Error 11 126.09
Total
14 203.33
Source DF Seq SS
LOAD
1
26.01
1YRT
1
14.07
ASSETS
1
37.16
T
P VIF
-0.99 0.343
0.63 0.543 2.6
-0.09 0.929 2.5
1.80 0.099 3.1
R-Sq(adj) = 21.1%
MS
25.75
11.46
F
2.25
P
0.140
3) Regression 3 is an attempt to look at the original independent variables. Only one part of the results is
encouraging. Comment on the coefficient of determination and the tests for significance and
(multi)collinearity. (2.5)
After the high apparent significance of the coefficients in Regression 2, the p-values here are a shock. All
are way above 10%, so none of the coefficients are significant. This is, of course, echoed in the high p-value
from the ANOVA and a pretty poor R-squared. At this point, if we had not seen regression 2, we would
suspect that our independent variables are complete duds.
However, the good news is the extremely low VIFs, indicating a total absence of collinearity.
MTB > Regress c2 4
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
c3 c5 c11 c6 ;
Regression 4
Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, ASSETS
The regression equation is
1YRR = 4.0 + 1.81 LOAD - 0.564 1YRT + 0.00455 1YRTsq + 0.0558 ASSETS
Predictor
Coef
SE Coef
T
P
VIF
Constant
3.98
19.12
0.21 0.839
LOAD
1.815
2.743
0.66 0.523
2.6
1YRT
-0.5635
0.4689 -1.20 0.257 42.9
1YRTsq
0.004554 0.003748
1.22 0.252 39.5
ASSETS
0.05581
0.03275
1.70 0.119
3.1
S = 3.31461
R-Sq = 46.0%
R-Sq(adj) = 24.4%
Analysis of Variance
Source
DF
SS
Regression
4
93.46
Residual Error 10 109.87
Total
14 203.33
Source
LOAD
1YRT
1YRTsq
ASSETS
DF
1
1
1
1
MS
23.37
10.99
F
2.13
P
0.152
Seq SS
26.01
14.07
21.48
31.90
4
252y0741 5/7/07
Unusual Observations
Obs LOAD
1YRR
Fit SE Fit Residual St Resid
2 1.00 1.900 -2.820
2.381
4.720
2.05R
R denotes an observation with a large standardized residual.
4) Regression 4 is a first step in building the model. What can we say about R-squared and R-squared
adjusted compared to regression 3? (1)
[5.5]
Solution: I should have looked at graphs of the residuals, but old bad habits die hard and we already had
evidence that a nonlinear equation was the way to go. R-squared and R-squared adjusted both rose, which is
encouraging.
The p-values for ‘1YRT’ and ‘1YRTsq’ are, at least, much less laughable than the p-values for ‘1YRT’
alone and, because of the tiny coefficient of the squared term, still seem to indicate that rising turnover will
hurt market performance. The high VIFs we are seeing are not important, since we now can guess that they
are caused by the necessary relationship between ‘1YRT’ and ‘1YRTsq.’
MTB > Regress c2 3
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression 5
c3 c15 c6 ;
Regression Analysis: 1YRR versus LOAD, ln1YRT, ASSETS
The regression equation is
1YRR = - 3.2 + 1.96 LOAD - 2.35 ln1YRT + 0.0554 ASSETS
Predictor
Coef SE Coef
Constant
-3.19
30.30
LOAD
1.956
2.774
ln1YRT
-2.355
6.318
ASSETS
0.05541 0.03307
S = 3.36573
R-Sq = 38.7%
Analysis of Variance
Source
DF
SS
Regression
3
78.72
Residual Error 11 124.61
Total
14 203.33
Source
LOAD
ln1YRT
ASSETS
DF
1
1
1
T
P VIF
-0.11 0.918
0.70 0.496 2.5
-0.37 0.716 2.4
1.68 0.122 3.1
R-Sq(adj) = 22.0%
MS
26.24
11.33
F
2.32
P
0.132
Seq SS
26.01
20.92
31.79
5) Regression 5 investigates the possibility of replacing the two terms in ‘1YRT’ with its logarithm.
Why did I decide this was a bad idea? (1)
Solution: The only thing good about this regression is that ‘ln1YRT’ has the expected negative sign. Rsquared and R-squared adjusted both fell and the p-value for the coefficient of ‘ln1YRT’ is above 50%.
MTB > Regress c2 4
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Identical to Regression 4
c3 c5 c11 c6 ;
Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, ASSETS
The regression equation is
1YRR = 4.0 + 1.81 LOAD - 0.564 1YRT +
Predictor
Coef
SE Coef
T
Constant
3.98
19.12
0.21
LOAD
1.815
2.743
0.66
1YRT
-0.5635
0.4689 -1.20
1YRTsq
0.004554 0.003748
1.22
ASSETS
0.05581
0.03275
1.70
S = 3.31461
R-Sq = 46.0%
0.00455 1YRTsq + 0.0558 ASSETS
P
VIF
0.839
0.523
2.6
0.257 42.9
0.252 39.5
0.119
3.1
R-Sq(adj) = 24.4%
Analysis of Variance
5
252y0741 5/7/07
Source
Regression
Residual Error
Total
Source
LOAD
1YRT
1YRTsq
ASSETS
DF
1
1
1
1
DF
4
10
14
SS
93.46
109.87
203.33
MS
23.37
10.99
F
2.13
P
0.152
Seq SS
26.01
14.07
21.48
31.90
Unusual Observations
Obs LOAD
1YRR
Fit SE Fit Residual St Resid
2 1.00 1.900 -2.820
2.381
4.720
2.05R
R denotes an observation with a large standardized residual.
MTB > Regress c2 5 c3 c5 c11 c6 c7;
Regression 5
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, ASSETS, ASSTsq
The regression equation is
1YRR = 30.7 + 1.39 LOAD - 0.259 1YRT + 0.00242 1YRTsq - 0.260 ASSETS
+ 0.000653 ASSTsq
Predictor
Coef
SE Coef
T
P
VIF
Constant
30.66
21.31
1.44 0.184
LOAD
1.387
2.408
0.58 0.579
2.6
1YRT
-0.2591
0.4368 -0.59 0.568 48.8
1YRTsq
0.002416
0.003444
0.70 0.501 43.6
ASSETS
-0.2596
0.1588 -1.63 0.137 96.3
ASSTsq
0.0006533 0.0003235
2.02 0.074 98.8
S = 2.89834
R-Sq = 62.8%
R-Sq(adj) = 42.2%
Analysis of Variance
Source
DF
SS
Regression
5 127.726
Residual Error
9
75.604
Total
14 203.329
Source
LOAD
1YRT
1YRTsq
ASSETS
ASSTsq
DF
1
1
1
1
1
MS
25.545
8.400
F
3.04
P
0.070
Seq SS
26.006
14.072
21.482
31.904
34.263
6) At long last we seem to be getting somewhere in Regression 5, though we may have to use a 10%
significance level to claim any accomplishment. What cheered me up? (1) [7.5]
Solution: The good news is that the p-value for the ANOVA just fell below 10%, which we can consider a
minimum acceptable significance level. Both R-squared and R-squared adjusted went up. On the other
hand, the negative sign of ‘ASSETS’ is disturbing, though it could be offset by the positive sign of
‘ASSETSsq.’
MTB > Regress c2 4
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
c3 c5 c11 c9;
Regression 6
Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, lnAssets
The regression equation is
1YRR = - 34.3 + 1.32 LOAD - 0.642 1YRT + 0.00497 1YRTsq + 10.1 lnAssets
Predictor
Coef
SE Coef
T
P
VIF
Constant
-34.26
49.59 -0.69 0.505
LOAD
1.324
2.886
0.46 0.656
2.6
1YRT
-0.6421
0.4874 -1.32 0.217 41.9
1YRTsq
0.004971 0.003928
1.27 0.234 39.2
6
252y0741 5/7/07
lnAssets
S = 3.48891
10.081
7.856
1.28 0.228
2.9
R-Sq = 40.1%
R-Sq(adj) = 16.2%
Analysis of Variance
Source
DF
SS
Regression
4
81.60
Residual Error 10 121.73
Total
14 203.33
Source
LOAD
1YRT
1YRTsq
lnAssets
DF
1
1
1
1
MS
20.40
12.17
F
1.68
P
0.231
Seq SS
26.01
14.07
21.48
20.04
Unusual Observations
Obs LOAD
1YRR
Fit SE Fit Residual St Resid
2 1.00 1.900 -2.601
2.801
4.501
2.16R
R denotes an observation with a large standardized residual.
Identical to Regression 5
MTB > Regress c2 5 c3 c5 c11 c6 c7;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression Analysis: 1YRR versus LOAD, 1YRT, 1YRTsq, ASSETS, ASSTsq
The regression equation is
1YRR = 30.7 + 1.39 LOAD - 0.259 1YRT + 0.00242 1YRTsq - 0.260 ASSETS
+ 0.000653 ASSTsq
Predictor
Coef
SE Coef
T
P
VIF
Constant
30.66
21.31
1.44 0.184
LOAD
1.387
2.408
0.58 0.579
2.6
1YRT
-0.2591
0.4368 -0.59 0.568 48.8
1YRTsq
0.002416
0.003444
0.70 0.501 43.6
ASSETS
-0.2596
0.1588 -1.63 0.137 96.3
ASSTsq
0.0006533 0.0003235
2.02 0.074 98.8
S = 2.89834
R-Sq = 62.8%
R-Sq(adj) = 42.2%
Analysis of Variance
Source
DF
SS
Regression
5 127.726
Residual Error
9
75.604
Total
14 203.329
Source
LOAD
1YRT
1YRTsq
ASSETS
ASSTsq
DF
1
1
1
1
1
MS
25.545
8.400
F
3.04
P
0.070
Seq SS
26.006
14.072
21.482
31.904
34.263
7) So why did I backtrack? (1) [8.5]
Solution: Again, trying to replace a quadratic equation for the relationship between Assets and the rate of
return with a logarithm lowered R-squared and R-squared adjusted and raised the p-values for the ANOVA
and the coefficients of assets.
7
252y0741 5/7/07
MTB > Regress c2 6
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
c3 c5 c11 c6 c7 c13;
Regression 7
Regression Analysis: 1YRR versus LOAD, 1YRT, ...
The regression equation is
1YRR = 121 - 47.4 LOAD - 0.460 1YRT + 0.00424 1YRTsq - 0.859 ASSETS
+ 0.00170 ASSTsq + 0.204 AssetsL
Predictor
Coef
SE Coef
T
P
VIF
Constant
120.50
51.93
2.32 0.049
LOAD
-47.36
26.33 -1.80 0.110 392.1
1YRT
-0.4604
0.4021 -1.14 0.285
52.6
1YRTsq
0.004239
0.003207
1.32 0.223
48.1
ASSETS
-0.8591
0.3521 -2.44 0.041 603.0
ASSTsq
0.0017033 0.0006339
2.69 0.028 482.5
AssetsL
0.2043
0.1100
1.86 0.100 309.9
S = 2.56960
R-Sq = 74.0%
R-Sq(adj) = 54.5%
Analysis of Variance
Source
DF
SS
Regression
6 150.507
Residual Error
8
52.823
Total
14 203.329
Source
LOAD
1YRT
1YRTsq
ASSETS
ASSTsq
AssetsL
DF
1
1
1
1
1
1
MS
25.084
6.603
F
3.80
P
0.043
Seq SS
26.006
14.072
21.482
31.904
34.263
22.781
Unusual Observations
Obs LOAD
1YRR
Fit SE Fit Residual St Resid
2 1.00 1.900 -0.053
2.391
1.953
2.08R
R denotes an observation with a large standardized residual.
8) a) Victory? What happened in the Regression 7 ANOVA that hasn’t occurred since Regression 2?
What still stinks? Why have I stopped worrying about VIFs? (3) [10.5]
Solution: We finally have a p-value for the ANOVA below 5%. On the other hand, the only coefficient that
is significant is that of ‘ASSTsq.’ As I stated earlier, the cause of the high VIFs seems to be the necessary
fact that if one variable enters the equation in different forms they will show correlation.
b) Use an F-test to check the value of adding ‘ASSTsq’ and ‘AssetsL.’ Don’t repeat any tests that have
already been done. (3)
Solution: Your raw materials here are the ANOVA
Analysis of Variance
Source
DF
SS
Regression
6 150.507
Residual Error
8
52.823
Total
14 203.329
MS
25.084
6.603
F
3.80
P
0.043
and the sequential sum of squares relating to the mentioned variables.
Source
ASSTsq
AssetsL
Sum
DF
1
1
2
Seq SS
34.263
22.781
57.044
8
252y0741 5/7/07
If we use the sum I just calculated to break up the ANOVA, we get the following.
Source
DF
SS
MS
F
F.05
4 indep variables 4
2 new variables
Residual Error
Total
2
8
14
93.463
23.365
3.54
57.044
52.823
203.329
28.522
6.603
4.32
4,8  3.84
F.05
2,8  4.46
F.05
I’m surprised. Though the improvement is evident at the 10% level, it is not at the 5% level that I used here.
Regression 8
MTB > Regress c2 7 c3 c5 c11 c14 c6 c7 c13;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression Analysis: 1YRR versus LOAD, 1YRT, ...
The regression equation is
1YRR = 161 + 47.8 LOAD - 2.77 1YRT + 0.0288 1YRTsq - 1.41 1YRTL - 0.809 ASSETS
+ 0.00167 ASSTsq + 0.166 AssetsL
Predictor
Coef
SE Coef
T
P
VIF
Constant
161.11
46.37
3.47 0.010
LOAD
47.81
48.13
0.99 0.354 1947.8
1YRT
-2.768
1.094 -2.53 0.039
578.5
1YRTsq
0.02882
0.01142
2.52 0.040
907.7
1YRTL
-1.4055
0.6353 -2.21 0.063 1567.3
ASSETS
-0.8091
0.2897 -2.79 0.027
606.7
ASSTsq
0.0016686 0.0005201
3.21 0.015
482.9
AssetsL
0.16616
0.09185
1.81 0.113
321.2
S = 2.10735
R-Sq = 84.7%
R-Sq(adj) = 69.4%
Analysis of Variance
Source
DF
SS
Regression
7 172.243
Residual Error
7
31.087
Total
14 203.329
Source
LOAD
1YRT
1YRTsq
1YRTL
ASSETS
ASSTsq
AssetsL
DF
1
1
1
1
1
1
1
MS
24.606
4.441
F
5.54
P
0.019
Seq SS
26.006
14.072
21.482
2.573
41.804
51.772
14.535
MTB > Name c16 "RESI1"
9
252y0741 5/7/07
Regression 9
MTB > Regress c2 6 c5 c11 c14 c6 c7 c13;
SUBC>
Residuals 'RESI1';
SUBC>
GHistogram;
SUBC>
GNormalplot;
SUBC>
NoDGraphs;
SUBC>
RType 1;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression Analysis: 1YRR versus 1YRT, 1YRTsq, ...
The regression equation is
1YRR = 163 - 1.88 1YRT + 0.0193 1YRTsq - 0.842
+ 0.00189 ASSTsq + 0.222 AssetsL
Predictor
Coef
SE Coef
T
P
Constant
162.54
46.31
3.51 0.008
1YRT
-1.8806
0.6305 -2.98 0.018
1YRTsq
0.019313
0.006217
3.11 0.015
1YRTL
-0.8416
0.2848 -2.96 0.018
ASSETS
-0.9475
0.2537 -3.73 0.006
ASSTsq
0.0018892 0.0004699
4.02 0.004
AssetsL
0.22154
0.07293
3.04 0.016
S = 2.10557
R-Sq = 82.6%
Analysis of Variance
Source
DF
SS
Regression
6 167.862
Residual Error
8
35.467
Total
14 203.329
Source
1YRT
1YRTsq
1YRTL
ASSETS
ASSTsq
AssetsL
DF
1
1
1
1
1
1
1YRTL - 0.947 ASSETS
VIF
192.4
269.5
315.5
466.3
394.9
202.8
R-Sq(adj) = 69.5%
MS
27.977
4.433
F
6.31
P
0.010
Seq SS
38.826
22.179
0.663
31.076
34.203
40.915
9) a) So why am I very happy with Regression 9? (1.5)
Solution: What happened in regressions 7 and 9 was that I removed ‘LOAD’ and added the interaction
variable ‘1YRTL.’ While ‘LOAD’ never had a significant coefficient, all the coefficients now are
significant.
a) Why was I willing to accept a lower value of R-squared than in regression 8? (1)
Solution: R-squared adjusted actually rose.
b) Lets check out the effect of the signs of the coefficients? To start with we have what are essentially two
equations here because of the interaction variables. What are they? (2)
The regression equation is
1YRR = 162.54 - 1.8806 1YRT + 0.019313 1YRTsq - 0.8416 1YRTL - 0.9475 ASSETS
+ 0.0018892 ASSTsq + 0.22154 AssetsL
For a Firm with no load ‘1YRTL’ and ‘AssetsL’ are zero so the equation reads as below.
The regression equation is
1YRR = 162.54 - 1.8806 1YRT + 0.019313 1YRTsq - 0.9475 ASSETS + 0.0018892 ASSTsq
For a firm with a load, we can add the coefficient of ‘1YRTL’ to the coefficient of ‘1YRT’ and add the
coefficient of ‘AssetsL’ to the coefficient of ‘ASSETS.’
The regression equation is
1YRR = 162.54 – 2.7222 1YRT + 0.01931 1YRTsq - 0.7257 ASSETS + 0.00189 ASSTsq
10
252y0741 5/7/07
c) The mean value of 1YRT is 58.2 and the mean value of ASSETS is 247.01. If these two values apply to a
given firm, what is the predicted value of 1YRR for (i) a no-load firm and (ii) a load firm? (2)
We have ‘1YRT’ = 58.2, ‘1YRTsq’ = 3387.24, ‘ASSETS’ = 247.01 and ‘Asstsq’ = 61013.94
For a no-load firm
1YRR = 162.54 - 1.8806 1YRT + 0.019313 1YRTsq - 0.9475 ASSETS +
0.0018892 ASSTsq
= 162.54 -1.8806(58.2) + 0.019313(3387.24) - 0.9475(247.01)
+ 0.0018892(61013.94)
=162.54 – 109.45 + 65.42 – 234.04 + 115.26 = -0.27
Minitab gets -0.27
For a firm with a load
1YRR = 162.54 – 2.7272 1YRT + 0.01931 1YRTsq - 0.7257 ASSETS +
0.00189 ASSTsq
= 162.54 – 2.7222(58.2) + 0.01931(3387.24) - 0.7257(247.01) +
0.00189(6103.94)
= 162.54 – 158.43 + 65.42 – 179.32 + 115.26 = 5.47
Minitab gets 5.54
d) So how do rises of 10 in 1-year turnover and 10 in Assets affect these firms? You need four answers
here. (4)
After a rise of 10 in ‘1YRT’ we have ‘1YRT’ = 68.2, ‘1YRTsq’ = 4651.24, ‘ASSETS’ = 247.01 and
‘Asstsq’ = 61013.94
For a no-load firm
1YRR = 162.54 - 1.8806(68.2) + 0.019313(4651.24) - 0.9475(247.01)
+ 0.0018892(61013.94)= 5.34 from -0.27
For a firm with a load
1YRR = 162.54 – 2.7222(68.2) + 0.01931(4651.24) - 0.7257(247.01)
+ 0.00189(61013.94)= 2.73 from 5.54
After a rise of 10 in ‘ASSETS,’ we have ‘1YRT’ = 58.2, ‘1YRTsq’ = 3387.24, ‘ASSETS’ = 257.01 and
‘Asstsq’ = 66054.14
For a no-load firm
1YRR = 162.54 - 1.8806(58.2) + 0.019313(3387.24) - 0.9475(257.01)
+ 0.0018892(66054.14)= -0.22 from -0.27
For a firm with a load
1YRR = 162.54 – 2.7272(58.2) + 0.01931(3387.24) - 0.7257(257.01) +
0.00189(66054.14)= 7.80 from 5.54
The results for no-load firms don’t seem to make much sense.
e) So, in view of d) are there any coefficients here that don’t seem reasonable? Why? (1.5) [25.5]
Solution: The results for no-load firms don’t seem to make much sense. Rises in assets hurt the rate of
return and rises in turnover help it.
Normplot of Residuals for 1YRR
Residual Histogram for 1YRR
MTB > NormTest c16;
SUBC>
KSTest.
Probability Plot of RESI1
MTB > NormTest c16.
Probability Plot of RESI1
MTB > NormTest c16;
SUBC>
RJTest.
11
252y0741 5/7/07
Probability Plot of RESI1
Wording in little boxes is:
Mean
StDev
N
KS
p-value
6.14175E-14
1.592
15
.186
RJ
>0.150
0.463 AD
>0.100
.506
.119
Regression 10
MTB > BReg c2 c5 c11 c14 c6 c7 c13 ;
SUBC>
NVars 1 6;
SUBC>
Best 2;
SUBC>
Constant.
Best Subsets Regression: 1YRR versus 1YRT, 1YRTsq, ...
Response is 1YRR
Vars
1
1
2
2
3
3
4
4
5
5
6
R-Sq
41.7
35.7
58.2
47.0
60.8
60.7
61.9
61.4
63.5
63.2
82.6
MTB > Save
01A.MTW";
R-Sq(adj)
37.2
30.8
51.3
38.1
50.1
50.0
46.7
46.0
43.2
42.7
69.5
Mallows
C-p
15.7
18.5
10.2
15.3
11.0
11.0
12.5
12.7
13.7
13.9
7.0
S
3.0203
3.1705
2.6599
2.9980
2.6918
2.6938
2.7826
2.7998
2.8710
2.8851
2.1056
1
Y
R
T
1
Y
R
T
s
q
1
Y
R
T
L
X
X
X
X X
X X
X X
X X X
A
S
S
E
T
S
A
S
S
T
s
q
X
X
X X
X
X X
X X
X X
X X
X X
X X
X X
A
s
s
e
t
s
L
X
X
X
X
X
"C:\Documents and Settings\RBOVE\My Documents\Minitab\252x07041-
12
252y0741 5/7/07
SUBC>
Replace.
Saving file as: 'C:\Documents and Settings\RBOVE\My
Documents\Minitab\252x07041-01A.MTW'
Existing file replaced.
10) So now I am even happier. The plots are for the Lilliefors test and two other similar tests. What
comforting news are they telling me? Regression 10 didn’t tell me much that I didn’t know already,
except for one thing. But how is it telling me that I have finished my job? (1.5) [27]
Solution: The only regression that has a satisfactory C-p is the one we just did. The C-p is supposed to be 1
plus the number of independent variables, which is exactly what it did. According to the plots, at least for
this model, I seem to have Normal residuals, as assumed.
Note: The general technique employed here was to look at all possible variables first. The high VIFs killed
that approach. I then restricted myself to the three basic variables that I had started. After assuring myself
that there was no collinearity present, I started worrying about significance and R-squared. I probably
should have given the computer the option of putting back some of the excluded variables when I used the
best subsets approach at the end.
II. Hand in your fourth computer problem. (2 to 8 points)
13
252y0741 5/7/07
III. Do at least 4 of the following 6 Problems (at least 12 each) (or do sections adding to at least 50 points –
(Anything extra you do helps, and grades wrap around). It is especially important to do more if you have
skipped much of parts I or II. You must do part b) of problem 1. Show your work! State H 0 and
H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions
without citing appropriate statistical tests – That is, explain your hypotheses and what values from
what table were used to test them. Clearly label what section of each problem you are doing! The
entire test has about 160 points, but 70 is considered a perfect score. Don’t waste our time by telling
me that two means, proportions, variances or medians don’t look the same to you. You need
statistical tests! There are some blank pages below. Put your name on as many loose pages as possible!
1) The data below represents four independent random samples. Each column represents hours between
breakdowns for the machines named. We assume that the underlying distribution is Normal. Please do the
following. Mark sections of your answer clearly.
a) If I want to test to see if the mean of x1 is larger than the mean of x 2 my null hypotheses are:
(Note: D  1   2 ) Only check one answer!
i) 1   2 and D  0
(2)
v) 1   2 and D  0
vi) 1   2 and D  0
ii) 1   2 and D  0
iii) 1   2 and D  0
vii) 1   2 and D  0
iv) 1   2 and D  0
viii) 1   2 and D  0
Solution: Since this question will appear on future exams, it is left to the student.
Machine 1
Machine 2
Machine 3
Machine 4
6.5
8.7
11.1
9.9
7.9
7.4
10.3
12.8
5.4
9.4
9.7
12.1
7.5
10.1
10.3
10.8
8.5
9.2
9.2
11.3
7.3
9.8
8.8
11.5
Note the following: n1  n2  n3  n4  6 For machine 1
x  43.1, x12  315.61 , s12  1.201667

For machine 2  x
For machine 3  x
For machine 4  x

1
2
 54.6
 x  591.56 , s  0.700000
 68.4, x  784.84 , s  1.016000
b) Find the sample variance for machine 2. Note that you may need  x later in the problem. (2)
3
4
 59.4,
2
3
2
3
2
4
2
4
2
2
c) Test the hypothesis that the mean for machine 2 is 10. Find an approximate p-value. (3)
d) Assume that the variances are the same for machine 2 and machine 3 and test to see if the mean
of machine 3 is larger than the mean of machine 2. Use a confidence interval, a test ratio and a
critical value for the mean (6). Or use just one of these three methods (3) [13]
e) Just to be on the safe side, test to see if the variances of the two machines were similar (3)
f) Assume equality of variances for all of the machines and test the hypothesis that all of the
machines have equal means. (6)
g) Explain how to test these columns to see if they have equal variances (1) [23]
x 22 later in the problem.
Solution: b) Find the sample variance for machine 2. Note that you may need

(2)
x2
x 22
8.7
7.4
9.4
10.1
9.2
9.8
54.6
75.69
54.76
88.36
102.01
84.64
96.04
501.50
14
252y0741 5/7/07
n2  6 ,
s22 

x
x 22
2
 54.6,
 nx 22
n 1

x
2
2
 50150
. . So x 2 
5015
.  6 9.10
x
n
2

54.6
 9.10 and
6
2
 0.9280 . Some of you still couldn’t do this!
5
c) Test the hypothesis that the mean for machine 2 is 10. Find an approximate p-value. (3)
Table 3 has the material below. But if we want a p-value we should use a test ratio.
Interval for
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
Mean (
x  0
  x  t 2 s x
xcv   0  t  2 s x
H0 :   0
t
unknown)
sx
H1 :    0
DF  n 1
s
sx 
n
Difference
H
:
D

D
*
d

D
D  d  t 2 s d
d  D0
0
0
cv
0  t 2 s d
t
between Two
H 1 : D  D0 ,
sd
1
1
Means (
sd  s p

D  1   2
n  1s12  n2  1s22
n1 n2
unknown,
sˆ 2p  1
n1  n2  2
variances
DF  n1  n2  2
assumed equal)
H 0 :   10
sx2 
H 1 :   10
t
s2

n2
s 22
0.9280

 0.1546667  0.39328
n2
6
9.10  10
 2.288 . We have 5 degrees of freedom and the relevant part of the t table is below.
0.39328
df .45 .40 .35 .30 .25 .20 .15 .10 .05 .025 .01 .005 .001
1
2
3
4
5
6
0.158
0.142
0.137
0.134
0.132
0.131
0.325
0.289
0.277
0.271
0.267
0.265
0.510
0.445
0.424
0.414
0.408
0.404
0.727
0.617
0.584
0.569
0.559
0.553
1.000
0.816
0.765
0.741
0.727
0.718
1.376
1.061
0.978
0.941
0.920
0.906
1.963
1.386
1.250
1.190
1.156
1.134
3.078
1.886
1.638
1.533
1.476
1.440
6.314
2.920
2.353
2.132
2.015
1.943
12.71
4.303
3.182
2.776
2.571
2.447
31.82
6.965
4.541
3.747
3.365
3.143
63.66
9.925
5.841
4.604
4.032
3.707
318.3
22.33
10.21
7.173
5.893
5.208
5
5
 2.015 and t .025
 2.571 , for a left-sided test we would say .025 < p-value <
Since 2.288 lies between t .05
.05. But since this is a 2-sided test, we must double these numbers to .05 < p-value < .10. Thus we cannot
reject the null hypothesis at the 95% confidence level.
d) Assume that the variances are the same for machine 2 and machine 3 and test to see if the mean
of machine 3 is larger than the mean of machine 2. Use a confidence interval, a test ratio and a
critical value for the mean (6). Or use just one of these three methods (3) [13]
Let D   2   3 . Our alternative hypothesis is  2   3 , since it does not contain an equality. If this is
H 0 :  2   3
H 0 : D  0
true, our hypotheses read 
or 
Wake up! This is a one-sided test. This means
H 1 : D  0
H 1 :  2   3
that confidence intervals and critical values must be one-sided.
and s32  0.700000. We thus have x 2 
x
2
 54.6 ,
x
3
 59.4 , s 22  0.9280
59 .4
54 .6
 9.90 , d  x 2  x3  0.8 and
 9.10 and x 3 
6
6
50.9280  50.7000
 1
2
1 
 0.8140 s d  sˆ 2p 
   0.8148  0.2716  0.5212
6
10
 n 2 n3 
Degrees of freedom are 6 + 6 – 2 = 10. I have copied part of the t-table below, and we can see that
10
t .05
 1.812 .
sˆ 2p 
15
252y0741 5/7/07
df .45 .40 .35 .30 .25 .20 .15 .10 .05 .025 .01 .005 .001
10 0.129 0.260 0.397 0.542 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.144
11 0.129 0.260 0.396 0.540 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 4.025
12 0.128 0.259 0.395 0.539 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.930
Confidence interval: If we check the alternate hypothesis H 1 : D  0 and form a 1-sided confidence interval
in the same direction we will make the given formula D  d  t  2 s d into D  d  t s d
 0.8  1.812 0.5212   0.8  0.944  0.144 . If D  0.144 , H 0 : D  0 can be true. So we do not reject
the null hypothesis.
Critical value for d : If we look at the alternate hypothesis, we are looking for a critical value for d that is
below zero. The given formula d cv  D0  t  2 s d becomes d cv  0  t s d  1.812 0.5212   0.944 . We
will reject the null hypothesis if d is below -0.944. Since d  0.8 is not below the critical value, we
cannot reject the null hypothesis. Make a diagram. Show an almost Normal curve centered at zero and a
‘reject region below -0.944. This is a left- sided test unless you defined D as D   3   2 .
d  D0
0.8  0

 1.535 . Make a diagram.
sd
0.5212
Show an almost Normal curve centered at zero and a ‘reject region below  t 10  1.812 . Since our
Test Ratio: The format for the test ratio was given as t 
.05
computed value of t is not below -1.812, do not reject the null hypothesis.
e) Just to be on the safe side, test to see if the variances of the two machines were similar (3).
H 0 :  2   3
s 2 0.9280
s2
0.7000
We test the two variance ratios 22 
 1.326 and 22 
 1 against

H
:



0
.
7000
0
..9280
s3
s3
3
 1 2
10,10  3.72 . Since the computed ratio is below the table value, reject the null hypothesis.
F.025

1
df 2  10
0.100 3.29
0.050 4.96
0.025 6.94
0.010 10.04
2
*
*
2.92
4.10
5.46
7.56
3
4
*
2.73
3.71
4.83
6.55
df1
6
5
*
2.61
3.48
4.47
5.99
*
2.52
3.33
4.24
5.64
*
2.46
3.22
4.07
5.39
7
*
8
*
2.41
3.14
3.95
5.20
*
2.38
3.07
3.85
5.06
9
*
2.35
3.02
3.78
4.94
10
11
12
*
2.32
2.98
3.72
4.85
*
2.30
2.94
3.66
4.77
*
2.28
2.91
3.62
4.71
f) Assume equality of variances for all of the machines and test the hypothesis that all of the
machines have equal means. (6)
A multiple test of means from random samples is 1-way ANOVA. H0 :  1   2   3   4
Sum
x.1
6.5
7.9
5.4
7.5
8.5
7.3
43.1
x .2
8.7
7.4
9.4
10.1
9.2
9.8
54.6
x .3
11.1
10.3
9.7
10.3
9.2
8.8
59.4
x .4
9.9
12.8
12.1
10.8
11.3
11.5
68.4
nj
6
6
6
6
x. j
7.1833 9.1000 9.9000 11.400
SS
x .2j
315.61 501.50 591.56 784.84 x 2  219351
.
2
51.5998 82.8100 98.0100 129.960  x. j  362 .3898
SOURCE
machine
error (within)
total
SS
DF
55.534 3
19.231 20
74.765 23
MS
F
18.511 19.25
0.9616
SST and SSB are computed below.
x  2255
.
n  24
3,20
Since F.05
 310
. , reject H 0 .
x  9.3958
16
252y0741 5/7/07
n j x.2j  6362 .380   2174 .279
.  24 9.3958
  x  nx  219351
SSB   n x  nx  2174.279  24 9.3958
SST 
2
2
2
j .j
2
2
2
 74.7646
 55534
.
g) Explain how to test these columns to see if they have equal variances (1) [23]
Since we are assuming that the parent distributions are Normal, the most appropriate test is a Bartlett test.
17
252y0741 5/7/07
2) Let us go back to the data in Problem 1. The data is now presented with their ranks. Assume that the
underlying distribution was different from the Normal.
Row
1
2
3
4
5
6
Mach1 RankM1
6.5
2
7.9
6
5.4
1
7.5
5
8.5
7
7.3
3
Mach2
8.7
7.4
9.4
10.1
9.2
9.8
RankM2
8.0
4.0
12.0
16.0
10.5
14.0
Mach3 RankM3
11.1
20.0
10.3
17.5
9.7
13.0
10.3
17.5
9.2
10.5
8.8
9.0
Mach4 RankM4
9.9
15
12.8
24
12.1
23
10.8
19
11.3
21
11.5
22
a) Assume that we believe that Machine 1 and Machine 4 have the same median. Treat them as a
single sample and test that the median is 8. (2 or 3 depending on method)
It may simplify your work in a-c) if I present the data as a single sample. The original numbers are
in ‘Mc1,4stacked.’ ‘McNo1,4’ just identifies the machine that they came from. ‘RankMc1,4’ ranks
the numbers from 1 through 12. ‘Mc1,4-8’ is simply ‘Mc1, 4stacked’ with 8 subtracted from it.
Row
1
2
3
4
5
6
7
8
9
10
11
12
Mc1,4stacked
6.5
7.9
5.4
7.5
8.5
7.3
9.9
12.8
12.1
10.8
11.3
11.5
McNo1,4
Mc1
Mc1
Mc1
Mc1
Mc1
Mc1
Mc4
Mc4
Mc4
Mc4
Mc4
Mc4
RankMc1,4
2
5
1
4
6
3
7
12
11
8
9
10
Mc1,4-8
-1.5
-0.1
-2.6
-0.5
0.5
-0.7
1.9
4.8
4.1
2.8
3.3
3.5
The most appropriate method is the Wilcoxon signed rank test, though the sign test can be done with the
same data. We are testing H 0 :  8 .
Row
1
2
3
4
5
6
7
8
9
10
11
12
x
6.5
7.9
5.4
7.5
8.5
7.3
9.9
12.8
12.1
10.8
11.3
11.5
Mc1
Mc1
Mc1
Mc1
Mc1
Mc1
Mc4
Mc4
Mc4
Mc4
Mc4
Mc4
r
x 8
x 8
rank
2
5
1
4
6
3
7
12
11
8
9
10
-1.5
-0.1
-2.6
-0.5
0.5
-0.7
1.9
4.8
4.1
2.8
3.3
3.5
1.5
0.1
2.6
0.5
0.5
0.7
1.9
4.8
4.1
2.8
3.3
3.5
5.0
1.0
7.0
2.5
2.5
4.0
6.0
12.0
11.0
8.0
9.0
10.0
signed rank
-5.0
-1.0
-7.0
-2.5
2.5
-4.0
6.0
12.0
11.0
8.0
9.0
10.0
12 13 
.
2
We check Table 7 and find that for a 2-sided 95% test with n  12 , we use 14 as a critical value. Because
19.5 is not below the critical value, we cannot reject the null hypothesis. To do a sign test, note that there
are 5 numbers below 8 and 7 above 8.
We have T   19.5 and T   58.5 . To check for accuracy of ranking note that 19 .5  58 .5  78 
b) Remember that these are actually two independent samples. Test for the equality of the medians
of Machines 1and 4. (3)
If we look at the column marked r , we can sum the ranks for each machine. We get SR1  21 and
12 13 
. Table 6 says that for the null
2
hypothesis of equality to be true the smaller of these two must be above 26. It isn’t, so we reject the null
hypothesis of equal medians. It says ‘independent samples’ here, so why did some people use a Wilcoxon
signed rank test?
SR4  57 . To check for accuracy of ranking note that 21  57  78 
18
252y0741 5/7/07
c) There is another test for the median available. The Wald-Wolfowitz test is a version of the runs
test. Put the numbers in order (Ranking has been done for you.) Now mark each number with a +
or a – depending on whether the number came from Machine 1 or Machine 4. Do a runs test. If the
null hypothesis of randomness is rejected, you can say that the medians are not the same. What do
you find? (3) The original data follows.
Row
1
2
3
4
5
6
7
8
9
10
11
12
Mc1,4stacked
6.5
7.9
5.4
7.5
8.5
7.3
9.9
12.8
12.1
10.8
11.3
11.5
McNo1,4
Mc1
Mc1
Mc1
Mc1
Mc1
Mc1
Mc4
Mc4
Mc4
Mc4
Mc4
Mc4
RankMc1,4
2
5
1
4
6
3
7
12
11
8
9
10
Mc1,4-8
-1.5
-0.1
-2.6
-0.5
0.5
-0.7
1.9
4.8
4.1
2.8
3.3
3.5
If we put the data in order as suggested we get the table below. Of course, since we have
‘RankMc1,’ we do not need the ‘Mc1,4stacked’ column.
Row Mc1,4stacked McNo1,4 RankMc1,4 Sign
1
5.4
Mc1
1
+
2
6.5
Mc1
2
+
3
7.3
Mc1
3
+
4
7.5
Mc1
4
+
5
7.9
Mc1
5
+
6
8.5
Mc1
6
+
7
9.9
Mc4
7
8 10.8
Mc4
8
9 11.3
Mc4
9
10 11.5
Mc4
10
11 12.1
Mc4
11
12 12.8
Mc4
12
-
The sequence is             . So n1  6 , n 2  6 and the number of runs is r  2 .
The runs test table says to reject randomness if the number of runs is 3 or lower or 11 or higher. So
we reject the null hypothesis of equal means.
d) Do a test of the four independent samples to see if their medians differ (4) [12, 35]
x1
r1
x2
r2
x3
r3
x4
r4
1
2
3
4
5
6
6.5
7.9
5.4
7.5
8.5
7.3
2
6
1
5
7
3
24
8.7
7.4
9.4
10.1
9.2
9.8
8.0
4.0
12.0
16.0
10.5
14.0
64.5
11.1
10.3
9.7
10.3
9.2
8.8
20.0
17.5
13.0
17.5
10.5
9.0
87.5
9.9
12.8
12.1
10.8
11.3
11.5
15
24
23
19
21
22
124
Solution: If the columns are independent random samples and the distribution is not Normal, we have a
Kruskal-Wallis test. The null hypothesis is that we have equal medians. The sums of ranks should add to
24 25 
 300 . The number of items in all the columns is n  24 , the number of items in each column is
2
ni  6 and we have 4 columns, which is too many items in a column to use the K-W table, so we say that
we will consider the K-W statistic to have a Chi-squared distribution with 4 – 1 = 3 degrees of freedom.
 12
 SRi 2 

  3n  1
We must compute the Kruskal-Wallis statistic H  
 nn  1 i  ni 
 12  24 2 64 .52 87 .52 124 2 

  325   12  27768 .5   75  92 .5617  75  17 .5617





6
6
6 
600 
6

 24 25   6

Since this is larger than  .2053  7.8147 , we reject our null hypothesis.
19
252y0741 5/7/07
3) Curses! It seems that these were not random samples at all. Each row represents a set of measurements
that were taken on one of 6 randomly chosen days. Thus the data is blocked by days.
Day Mach1 Mach2 Mach3 Mach4
RowSum RowSum of Squares
1
2
3
4
5
6
6.5
7.9
5.4
7.5
8.5
7.3
8.7
7.4
9.4
10.1
9.2
9.8
11.1
10.3
9.7
10.3
9.2
8.8
9.9
12.8
12.1
10.8
11.3
11.5
36.2
38.4
36.6
38.7
38.2
37.4
339.16
387.10
358.02
380.99
369.22
359.02
a) Using the fact that the data is cross classified and assuming that the underlying distribution is
Normal, test that the means of Machines 1 and 4 are equal. (4) Note that the credit on this problem
is large because none of the computations above or in previous problems will help you here.
The table below is the setup for both a) and c) The differences and their squares that are needed for
a) are calculated. For the Wilcoxon signed rank test in c) the absolute values of the differences are
calculated in d . These are ranked in rank and have their signs restored in signed rank .
Row
1
2
3
4
5
6
x 4 d  x1  x 4
x1
6.5
7.9
5.4
7.5
8.5
7.3
9.9
12.8
12.1
10.8
11.3
11.5
-3.4
-4.9
-6.7
-3.3
-2.8
-4.2
-25.3
d2
11.56
24.01
44.89
10.89
7.84
17.64
116.83
Interval for
Confidence
Interval
Hypotheses
Difference
between Two
Means (paired
data.)
D  d t  2 s d
H 0 : D  D0 *
d  x1  x 2
df  n  1 where
n1  n 2  n
d
rank
signed rank
3.4
4.9
6.7
3.3
2.8
4.2
3
5
6
2
1
4
-3
-5
-6
-2
-1
-4
-21
Test Ratio
t
H 1 : D  D0 ,
D  1   2
Critical Value
d  D0
sd
sd 
d cv  D0 t  2 s d
sd
n
 H 0 : 1   4
H 0 : D  0
Our hypotheses read 
or 
 H 1 : 1   4
H 1 : D  0
 d  25.3
 d  25.3  d  116 .83 . So d  n  6  4.2167 and
 d  nd  116 .83  6 4.2167   2.0967 df  n 1  5 t    2.571

n  6,
2
2
s d2
2
2
5
.025
n 1
5
Use only one of the following methods.
2.02967
6
 4.2167  2.5710.5817   4.2167  1.4593 Since the error term is smaller in absolute
value than the mean difference, we must reject the null hypothesis.
Critical value for d : The given formula d cv  D0  t  2 s d becomes
Confidence interval: D  d  t  2 s d is now D  4.2167  2.571
d cv  0  2.5710.5817  1.4593. Make a diagram. Show an almost Normal curve
centered at zero and ‘reject’ regions below -1.4593 and above 1.4593. Since
d  4.2167 falls in the lower reject region, we reject the null hypothesis.
Test Ratio: The format for the test ratio was given as t 
d  D0
4.2167

 7.249 .
0.5817
sd
Make a diagram. Show an almost Normal curve centered at zero and ‘reject’ regions
20
252y0741 5/7/07
5
5
below  t .025
 2.571 and above t .025
 2.571 . Since our computed value of t is below -
2.571, reject the null hypothesis.
b) Using the fact that the data is cross classified and assuming that the underlying distribution is
Normal, test that the means of all four machines are equal. Note that you already should have done
about half of this problem in 1f). (4)
If we copy the data at the beginning of the problem we have
Day Mach1
1
2
3
4
5
6
Mach2
8.7
7.4
9.4
10.1
9.2
9.8
6.5
7.9
5.4
7.5
8.5
7.3
Mach3
11.1
10.3
9.7
10.3
9.2
8.8
Mach4
9.9
12.8
12.1
10.8
11.3
11.5
RowSum n i
x i 
36.2
4 9.050
38.4
4 9.600
36.6
4 9.150
38.7
4 9.675
38.2
4 9.550
37.4
4 9.350
225.5 24 (9.3958)

Sum
43.1
6
nj
54.6
6
59.4
6
x
x
n
x  2255
.
n  24
68.4
6
339.16 81.9025
387.10 92.1600
358.02 83.7225
380.99 93.6056
369.22 91.2025
359.02 87.4225
2193.51 530.0156
 x
2

x
2
i..
( x  9.3958 ) This is not a sum.
x. j 7.1833
9.1000 9.9000
SS
315.61
x .2j
51.5998
x2  2193 .51
82.8100 98.0100 129.960  x.2j  362 .3898
501.50 591.56
xi 2
SS
11.400
784.84
We already know
.  24 9.3958  74.7646
  x  nx  219351
SSC   n x  n x  R x  nx 2174 .279  24 9.3958   55 .534
SSR  C x  nx  4530.0156  249.3958  2120.0624  2118.7454  1.3170
SST 
2
2
2
2
j .j
2
i.
2
2
.j
2
2
Source
SS
Rows
1.317
Columns
2
2
55.534
DF
5
3
MS
0.2634
18.5113
F
F.05
0.2206ns
F 5,15  2.90
15.5010s
F 3,15  3.29
H0
Row means equal
Column means equal
Within
17.9136
15
1.1942
Total
74.7646
23
Predictably, a Minitab run shows us to have generated a lot of rounding error, but the facts are the same.
Two-way ANOVA: McStacked versus Row, McNo
Source DF
Row
5
McNo
3
Error
15
Total
23
S = 1.093
SS
MS
F
P
1.3021
0.2604
0.22 0.949
55.5213 18.5071 15.49 0.000
17.9263
1.1951
74.7496
R-Sq = 76.02%
R-Sq(adj) = 63.23%
c) But let’s go back to the possibility that the underlying distribution is not Normal. Using the fact
that the data is cross classified and assuming that the underlying distribution is not Normal, test
that the medians of Machines 1 and 4 are equal. Remember that none of the methods that we
used to compare medians actually involves computing the median! (3)
If we go back to the table in a), we find that the two signed rank sums are T   0 and
T   21. According to Table 7, for a 2-sided 5% test, we reject the null hypothesis of
equal medians if the smaller of the two rank sums is  1 . Thus we reject the null
hypothesis.
21
252y0741 5/7/07
d) Using the fact that the data is cross classified and assuming that the underlying distribution is
not Normal, test that the medians of all four machines are equal. (4)
Solution: This is a Friedman test.
H0:  1   2   3   4 Where 1 is A, 2 is B, 3 is C and 4 is D.
H1: At least one of the medians differs.
First we rank the data within rows. The data appears below in columns marked x1 to x 4 and the
ranks are in columns marked r1 to r4 .
1
2
3
4
5
6
x1
r1
x2
r2
x3
r3
x4
r4
6.5
7.9
5.4
7.5
8.5
7.3
1
2
1
1
1
1
7
8.7
7.4
9.4
10.1
9.2
9.8
2
1
2
2
2.5
3
12.5
11.1
10.3
9.7
10.3
9.2
8.8
4
3
3
3
2.5
2
17.5
9.9
12.8
12.1
10.8
11.3
11.5
3
4
4
4
4
4
23
To check the ranking, note that the sum of the four rank sums is 7 + 12.5 + 17.5 + 23 = 60, and
rcc  1 645
SRi 

 60 . Now compute the Friedman
that the sum of the rank sums should be
2
2
 12

 12
7 2  12 .52  17 .52  23 2   365
SRi2   3r c  1  
statistic  F2  






6
4
5
rc
c

1


i




 

 12
1040 .5  90  14 .05 . This problem is too large for the Friedman table., so we say that we will

120

consider the Friedman statistic to have a Chi-squared distribution with 4 – 1 = 3 degrees of freedom. Since
this is larger than  .2053  7.8147 , we reject our null hypothesis of equal column medians.
e) To check whether c) or d) is correct do a Normality test on the entire data set. Most of this is
done for you. Column (1) is the original data. Column (2) is the original data standardized. Column
(3) is a cumulative Normal probability. Column (5) is column (4) divided by 24. Column (6) is the
difference between Column (3) and Column (5). Please explain carefully how you complete this
test, including what table you are using. (3) [18, 53]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
(1)
5.4
6.5
7.3
7.4
7.5
7.9
8.5
8.7
8.8
9.2
9.2
9.4
9.7
9.8
9.9
10.1
10.3
10.3
10.8
11.1
11.3
11.5
12.1
12.8
(2)
-2.21650
-1.60632
-1.16256
-1.10709
-1.05162
-0.82974
-0.49692
-0.38598
-0.33051
-0.10863
-0.10863
0.00231
0.16872
0.22419
0.27966
0.39060
0.50154
0.50154
0.77889
0.94530
1.05624
1.16718
1.50001
1.88830
(3)
0.013329
0.054101
0.122504
0.134127
0.146486
0.203343
0.309623
0.349756
0.370507
0.456748
0.456748
0.500922
0.566992
0.588696
0.610132
0.651954
0.692005
0.692005
0.781979
0.827748
0.854572
0.878432
0.933194
0.970507
(4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
(5)
0.04167
0.08333
0.12500
0.16667
0.20833
0.25000
0.29167
0.33333
0.37500
0.41667
0.45833
0.50000
0.54167
0.58333
0.62500
0.66667
0.70833
0.75000
0.79167
0.83333
0.87500
0.91667
0.95833
1.00000
(6)
-0.0283379
-0.0292319
-0.0024965
-0.0325396
-0.0618468
-0.0466575
0.0179560
0.0164224
-0.0044926
0.0400817
-0.0015850
0.0009221
0.0253256
0.0053627
-0.0148684
-0.0147122
-0.0163279
-0.0579946
-0.0096878
-0.0055851
-0.0204282
-0.0382345
-0.0251398
-0.0294930
22
252y0741 5/7/07
This problem was 95% done for you, but no one did it. The best method to use here is Lilliefors
because the data is not stated by intervals, the distribution for which we are testing is Normal, and
the parameters of the distribution are unknown. We began by putting the data in order and
xx
computing z 
(actually t ) and proceeded as in the Kolmogorov-Smirnov method. For
s
138  x
 1.60632 , O  1 because there is only one number in
example, in the second row z 
s
each interval, n 
 O  24 - To compute the observed relative distribution, we summed the
O column into the CumO column and then divided by n  24 so that each value of Fo shows the
fraction of observations that were  x . Fe is a fairly exact computation of the standardized
Normal probability of being below x . It can be approximated using the Normal table. For
example, for row 2, z  1.60632 and the cumulative probability will be
Fe  F 1.60   Pz  1.60   Pz  0  P1.60  z  0  .5  .4452  .0548  .054101 . The
last column should be the absolute value of the difference between Fe and Fo , but the signs have
not been removed. The largest absolute value seems to be 0.0618468. According to the Lilliefors
table, the 5% critical value for n  24 is between .190 and .180. Since none of the numbers in the
D column are that large in absolute value, we cannot reject the null hypothesis that the
distribution is Normal.
x
Row
O CumO Fo
Fe
D  Fe  Fo
z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
5.4
6.5
7.3
7.4
7.5
7.9
8.5
8.7
8.8
9.2
9.2
9.4
9.7
9.8
9.9
10.1
10.3
10.3
10.8
11.1
11.3
11.5
12.1
12.8
-2.21650
-1.60632
-1.16256
-1.10709
-1.05162
-0.82974
-0.49692
-0.38598
-0.33051
-0.10863
-0.10863
0.00231
0.16872
0.22419
0.27966
0.39060
0.50154
0.50154
0.77889
0.94530
1.05624
1.16718
1.50001
1.88830
0.013329
0.054101
0.122504
0.134127
0.146486
0.203343
0.309623
0.349756
0.370507
0.456748
0.456748
0.500922
0.566992
0.588696
0.610132
0.651954
0.692005
0.692005
0.781979
0.827748
0.854572
0.878432
0.933194
0.970507
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0.04167
0.08333
0.12500
0.16667
0.20833
0.25000
0.29167
0.33333
0.37500
0.41667
0.45833
0.50000
0.54167
0.58333
0.62500
0.66667
0.70833
0.75000
0.79167
0.83333
0.87500
0.91667
0.95833
1.00000
-0.0283379
-0.0292319
-0.0024965
-0.0325396
-0.0618468
-0.0466575
0.0179560
0.0164224
-0.0044926
0.0400817
-0.0015850
0.0009221
0.0253256
0.0053627
-0.0148684
-0.0147122
-0.0163279
-0.0579946
-0.0096878
-0.0055851
-0.0204282
-0.0382345
-0.0251398
-0.0294930
23
252y0741 5/7/07
4) (Allen L. Webster) A regional manager wishes to examine the relationship between the amount spent on
a given day, in dollars ( y ) on the basis of income in thousands of dollars ( x1 ) and gender ( x 2 -a dummy
variable – 1 for female) and an interaction variable ( x3 ) which is a product of the other two independent
variables. The data is below (Use   .05) . The last column was supposed to be computed by you!
Row
Expenditure
Income
Gender
1
2
3
4
5
6
7
8
9
10
y
x1
x2
x3
51
30
32
45
51
31
50
47
45
39
40
25
27
32
45
29
42
38
30
29
1
0
0
1
1
0
1
1
0
1
40
0
0
32
45
0
42
38
0
29
x2 y
Inter
51
0
0
45
51
0
50
47
0
39
283
The following are given to help you.
y  421,
y 2  18367,
x  337,


 x y  14655,  x
1

y  ?,  x x
x
2
1
1
2
1 2
 11793,
x
2
 6,
x
2
2
 6,
 226 and n  10 .
You do not need all of these in this problem.
[53]
a. Compute a simple regression of expenditure against x1 . (5)
b. Compute R 2 (4)
c. Compute s e (3)
d. Compute s b1 (the std. deviation of the slope) and do a confidence interval for  0 . On the basis of these
two exercises, are either or both significant? (4.5)
e. My Aunt Gertrude, who has an income of $25 thousand has just walked into the store. Create an
appropriate interval to predict her spending. (3.5) [20, 73]
The quantities below are given:
y  421,
x  337,
n  10 ,


 x y  14655
x x
and
1
1 2
1
x
2
 6,
y
2
 18367,
x
2
1
 11793,
x
2
2
 6,
 226.
a). You do not need all of these.
Spare Parts Computation:
421
337
6
x 2 y  283 , Y 
 42 .1 , X 1 
 33 .7 , X 2 
 0.60
10
10
10

 X 12  nX 12  SSX1  117931033.702  436.10*
†Needed only in next problem
 X  nX  SSX  6  100.60  = 2.40*†
 Y  nY  SST  SSY  18367 1042.1 = 642.90*
 X Y  nX Y  SX Y  14655 1033.742.1 = 467.30
 X Y  nX Y  SX Y  283 100.642.1 = 30.4†
 X X  nX X  SX X  226 1033.70.60  = 23.80†
2
2
2
2
2
2
2
2
2
1
1
2
2
1
2
1
2
1
2
1
2
*Must be positive. The rest may well be negative.
24
252y0741 5/7/07
Solution: The coefficients are b1 
S xy
SS x

 XY  nXY
 X  nX
2
2

467 .30
 1.0715 and
436 .1
b0  Y  b1 X  42 .1  1.0715  33 .7   5.9905 . So Yˆ  5.9905  1.0715 X .
b) Compute R 2 (4) Solution: SSR  b1 S xy  1.0715 (467 .3)  500 .7120 . So
S xy 2
467 .30 2  .7789
SSR b1 S xy 500 .7120
R 


 .7788 or

SST
SSy
642 .90
SS x SS y 436 .10 642 .90 
2
c) Compute s e (3) Solution: s e2 
SSE SST  SSR SS y  b1 S xy 642 .9  500 .7120



 17 .7735
n2
n2
n2
8
s e  17 .7735  4.2159
d) Compute s b1 (the std deviation of the slope) and do a confidence interval for  0 . On the basis of these
two exercises, are either or both significant? (4.5) Solution: We have n  2  8 degrees of freedom, so we
1 X 2 
H 0 :  0  0
8
 2.306 . Recall X 1  33 .7 . To test 
use t .025
compute s b20  s e2  

H 1 :  0  0
 n SS x 
 1 33 .7 2 
 17 .7735  
  17 .7735 2.7042   48 .0630 . So sb0  48.0630  6.9327 The confidence
10 436 .1 
interval would be  0  5.9905  2.306 6.9327   5.99  15 .99 . Since the error part is larger than 5.99 in
absolute value, this interval includes zero, so do not reject the null hypothesis.
 1  17 .7735
H 0 : 1  0
 0.04076 . So sb1  0.04076  0.20188
To test 
compute s b21  s e2 
 
436 .1
H
:


0
 1 1
 SS x 
t
b1  0 1.0715
8
8
 2.306 and above t .025
 2.306 .

 5.308 The ‘reject’ regions are below  t .025
s b1
0.20188
Since our computed t ratio is in the upper ‘reject’ zone, reject the null hypothesis.
e.)My Aunt Gertrude, who has an income of $25 thousand has just walked into the store. Create an
appropriate interval to predict her spending. (3.5) [20, 73] The appropriate interval is a prediction interval.


2


X0  X
2
2 1
ˆ
Solution: The Prediction Interval is Y0  Y0  t sY , where sY  s e

 1 and
n

SS x


ˆ
ˆ
Y0  b0  b1 X 0 . X 0 = 25. X 1  33 .7 We have already found Y  5.9905  1.0715 X , s e2  17.7735 ,
SS  436 .10 . So Yˆ  5.9905  1.071525 = 32.778.
x
0
 1 25  33 .72

8
 2.306 ,
sY2  17 .7735  
 1  17 .7735 1.1  0.17356   22 .6356 and sY  4.7577 . t .025
 10

436
.
10


so Y0  Yˆ0  t sY  32.778  2.306 4.7577   32 .78  10 .97 .
25
252y0741 5/7/07
5) Data from the previous problem is repeated below. (Use   .05) .
Row
Expenditure
1
2
3
4
5
6
7
8
9
10
Income
Gender
y
x1
x2
Inter
x3
51
30
32
45
51
31
50
47
45
39
40
25
27
32
45
29
42
38
30
29
1
0
0
1
1
0
1
1
0
1
40
0
0
32
45
0
42
38
0
29
The following are given to help you.
y  421,
y 2  18367,
x  337,


 x y  14655,  x
1
.

y  ?,  x x
1
2
1 2
x
2
1
 11793,
x
2
x
 6,
 6,
2
2
 226 and n  10 .
a. Do a multiple regression of expenditure against x1 and x 2 . This involves a simultaneous equation
solution. Attempting to recycle b1 from the previous page won’t work. (12)
b. Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare
the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with
the R 2 from the previous problem. The F test here is one to see if adding a new independent variable
improves the regression. (4)
c. Compute the regression sum of squares and use it in an ANOVA F test to test the usefulness of this
regression. (5)
d. My Aunt Gertrude, who has an income of $25 thousand, has just walked into the store. Use your
regression to predict how much she will spend. Do a confidence interval and a prediction interval. (4) [25,
98]
Solution: a) This is copied from the last problem. Spare Parts Computation:
421
337
6
Y 
 42 .1 , X 1 
 33 .7 , X 2 
 0.60
10
10
10
X
X
Y
2
1
 nX 12  SSX1  11793 1033.702  436.10*
2
2
 nX 22  SSX 2  6  10 0.60 2 = 2.40*†
2
 nY 2  SST  SSY  18367  10 42 .12 = 642.90*
 X Y  nX Y  SX Y  14655 1033.742.1 = 467.30
 X Y  nX Y  SX Y  283 100.642.1 = 30.4†
 X X  nX X  SX X  226 1033.70.60  = 23.80†
1
1
2
2
1
2
1
2
1
2
1
2
*Must be positive. The rest may well be negative.
We substitute these numbers into the Simplified Normal Equations:
X 1Y  nX 1Y  b1
X 12  nX 12  b2
X 1 X 2  nX 1 X 2


 X Y  nX Y  b  X X
2
2
1
1
2
 
 nX X   b  X
1
2
2
2
2


,
nX 22
467 .30  436 .10b1  23 .80 b2

 30 .40  23 .80b1  2.40 b2
and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations
which are
to solve until we notice that, if we multiply 2.40 by 23 .80
2.40
 9.916667 we get 23.80. We multiply the
26
252y0741 5/7/07
 436 .10b1
 23 .80b2
 467 .30
second equation by 9.916667 and the equations become 
. If we
 301 .46667  236 .01667 b1  23 .80 b2
subtract the second from the first, we get 165 .83333  200 .08333 b1 . This means that
165 .83333
 0.8288 . Now remember that 30.40  23.80b1  2.40 b2 . This can be rewritten as
200 .08333
2.40 b2  30 .40  23.80b1 . If we divide by 2.40, we get b2  12 .6667  9.916667 b1 Let’s substitute in
b1 
165 .83333
 0.8288 . b2  12.6667  9.916667 0.8288   12.6667  8.21945  4.4476 . So (It’s worth
200 .08333
checking your work by substituting your values of b1 and b2 back into the normal equations.) Finally we
b1 
get b0 by using Y  42.1 , X 1  33 .7 , X 2  0.60 in b0  Y  b1 X 1  b2 X 2
 42 .1  0.82888 33 .7  4.4476 0.60   42 .1  27 .9333  2.6686  11 .4981 . Thus our equation is
Yˆ  b0  b1 X 1  b2 X 2  11.4981 0.8288X 1  4.4476X 2
b) Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare
the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with
the R 2 from the previous problem. The F test here is one to see if adding a new independent variable
improves the regression. (4)
Solution: From the first regression we have SST  SSy  642.90 , R 2  RY2.1  0.7788 *,
 X Y  nX Y  SX Y
1
1
1
= 467.30, b1  1.0715 and SSR  b1 S xy  1.0715 (467 .3)  500 .7120 .
From the second regression Yˆ  b0  b1 X 1  b2 X 2  11.4981 0.8288X 1  4.4476X 2
Y  56.6667 , X 1  33 .3333 , X 2  28 .7778 ,
X
2Y
 nX 2 Y  SX 2 Y = 30.4.
This means that
SSR  b1 Sx1 y  b2 Sx2 y  0.8288 467 .30   4.4476 30.4  387 .2982  135 .2070  522 .5052 .
SSR 522 .5052

 0.8127 *. If we use R 2 , which is R 2 adjusted for degrees of freedom, for
SST
642 .90
the first regression, the number of independent variables was k  1 and
R 2  RY2.12 
R2 
n  1R 2  k  90.7788  1  .7511
R2 
n  1R 2  k  90.8127  2  .7592 . R-squared adjusted is supposed to rise if our new variable has
n  k 1
8
and for the second regression k  2 and
n  k 1
7
any explanatory power. Note: * These numbers must be positive. The rest may well be negative.
There are two ways to do the F test. We can use the second regression to give us SSE  SST  SSR2
 642 .90  522 .5052  120 .3948 . In the second regression, the explained sum of squares rises by 522.5052
– 500.7120 = 21.7932. We can make an ANOVA table for looking at a new variable as follows. Assume
that we have SSR1  500 .7120 for the first regression on k independent variables and add r new
independent variables and get a new SSR2  522 .5052
Source
SS
First Regression SSR1
2nd Regression
SSR 2  SSR1
Error
Total
SSE
SST
DF
MS
Fcalc
k
MSR1
MSR1 MSE
F
F k , nk r 1
r
MSR2
MSR2 MSE
F r , nk r 1
n  k  r  1 MSE
n 1
27
252y0741 5/7/07
Source
SS
DF
MS
Fcalc
First Regression 500.7120
1
500.7120
2nd Regression
1
21.7932
21.7932
F
1,7   5.59
29.11s F.05
1,7   5.59
1.26ns F.05
Error
120.3948
7
17.1993
Total
642.90
9
We can get the same results using R 2 . Remember RY2.12  0.8127 and RY2.1  0.7788 .
Source
SS
DF
MS
Fcalc
 0.7788
First Regression
RY2.1
1
.7788
2nd Regression
RY2.12  RY2.1  0.8127  0.7788  .0339 1
.0339
Error
1  RY2.12
.02676
 1  0.8127  .1873
7
F
1,7   5.59
29.10s
F.05
1.27ns F 1,7   5.59
.05
Total
Column must add to 1.000
9
Note that this seems to show that the second independent variable made no significant improvement in the
amount of variation in Y explained by the independent variables.
c) Compute the regression sum of squares and use it in an ANOVA F test to test the usefulness of this
regression. (5) We already did this in b) Remember SSE  SST  SSR2  642 .90  522 .5052  120 .3948 .
Source
SS
nd
First and 2 Regression
522.5052
DF
2
MS
261.25
Fcalc
15.19s
F
2,7   4.74
F.05
Error
120.3948
7
17.20
Total
647.90
9
The null hypothesis is no connection between Y and the X’s. It is rejected because the calculated F is above
the table value.
d) My Aunt Gertrude, who has an income of $25 thousand, has just walked into the store. Use your
regression to predict how much she will spend. Do a confidence interval and a prediction interval.
Solution: From the second regression Yˆ  b0  b1 X 1  b2 X 2  11.4981 0.8288X 1  4.4476X 2
The problem says X 1  25 , and, since Aunt Gertrude is female, X 2  1 .
Yˆ  11.4981 0.828825  4.44761  11.4981 20.7200  4.4476  36.67
7
The error mean square in the ANOVA is 17.20 and has 7 degrees of freedom. So use t .05
 1.895 and
s e  17 .20  4.1473 . The confidence interval is an interval for an average value of Y when the
independent variables are 35 and 1, but there is nothing average about my Aunt Gert, so the prediction
interval is more appropriate than the confidence interval.
s
The outline says that an approximate confidence interval is  Y0  Yˆ0  t e
n
 4.1473 
  36 .67  2.49 and an approximate prediction interval is
 36 .67  1.895 

 10 
Y  Yˆ  t s  36.67  1.895 4.1473   36.67  7.86 .
0
0
e
The regression output from Minitab follows.
MTB > Regress c1 1 c2 ;
SUBC>
Constant;
SUBC>
Brief 3.
Regression Analysis: Expend versus Income
The regression equation is
Expend = 5.99 + 1.07 Income
Predictor
Coef SE Coef
T
P
Constant
5.989
6.932 0.86 0.413
Income
1.0715
0.2019 5.31 0.001
S = 4.21556
R-Sq = 77.9%
R-Sq(adj) = 75.1%
28
252y0741 5/7/07
Analysis of Variance
Source
DF
SS
Regression
1 500.73
Residual Error
8 142.17
Total
9 642.90
MS
500.73
17.77
F
28.18
Obs
1
2
3
4
5
6
7
8
9
10
SE Fit
1.84
2.20
1.90
1.38
2.64
1.64
2.14
1.59
1.53
1.64
Residual
2.15
-2.78
-2.92
4.72
-3.21
-6.06
-0.99
0.29
6.86
1.94
Income
40.0
25.0
27.0
32.0
45.0
29.0
42.0
38.0
30.0
29.0
Expend
51.00
30.00
32.00
45.00
51.00
31.00
50.00
47.00
45.00
39.00
Fit
48.85
32.78
34.92
40.28
54.21
37.06
50.99
46.71
38.14
37.06
P
0.001
St Resid
0.57
-0.77
-0.78
1.19
-0.98
-1.56
-0.27
0.07
1.75
0.50
MTB > Regress c1 2 c2 c3;
SUBC>
Constant;
SUBC>
Brief 3.
Regression Analysis: Expend versus Income, Gender
The regression equation is
Expend = 11.5 + 0.829 Income + 4.45 Gender
Predictor
Coef SE Coef
T
P
Constant
11.500
8.396 1.37 0.213
Income
0.8288
0.2932 2.83 0.026
Gender
4.448
3.952 1.13 0.298
S = 4.14707
R-Sq = 81.3%
R-Sq(adj) = 75.9%
Analysis of Variance
Source
DF
SS
Regression
2 522.51
Residual Error
7 120.39
Total
9 642.90
Source
Income
Gender
DF
1
1
MS
261.26
17.20
F
15.19
P
0.003
Seq SS
500.73
21.78
Obs Income Expend
Fit SE Fit Residual St Resid
1
40.0
51.00 49.10
1.83
1.90
0.51
2
25.0
30.00 32.22
2.22
-2.22
-0.63
3
27.0
32.00 33.88
2.09
-1.88
-0.52
4
32.0
45.00 42.47
2.37
2.53
0.74
5
45.0
51.00 53.24
2.74
-2.24
-0.72
6
29.0
31.00 35.54
2.11
-4.54
-1.27
7
42.0
50.00 50.76
2.12
-0.76
-0.21
8
38.0
47.00 47.44
1.70
-0.44
-0.12
9
30.0
45.00 36.36
2.18
8.64
2.45R
10
29.0
39.00 39.98
3.05
-0.98
-0.35
R denotes an observation with a large standardized residual.
29
252y0741 5/7/07
6) Data from the previous problem is repeated below. (Use   .05) .
Row
1
2
3
4
5
6
7
8
9
10
Expenditure
Income
Gender
y
x1
x2
Inter
x3
51
30
32
45
51
31
50
47
45
39
40
25
27
32
45
29
42
38
30
29
1
0
0
1
1
0
1
1
0
1
40
0
0
32
45
0
42
38
0
29
Part of the printout from the last regression follows.
Regression Analysis: Expend versus Income, Gender, Inter
The regression equation is
Expend = - 28.5 + 2.27 Income + 48.8 Gender - 1.56 Inter
Predictor
Coef SE Coef
Constant
-28.53
27.62
Income
2.2712
0.9930
Gender
48.80
29.61
Inter
-1.557
1.032
S = 3.81355
R-Sq = 86.4%
T
P
-1.03 0.342
2.29 0.062
1.65 0.150
-1.51 0.182
R-Sq(adj) = 79.6%
a) Check the above printout i) to see if adding the interaction variable improved our regression and
ii) to compute a partial correlation between interaction and expenditure. Explain the meaning of
the partial correlation. (2.5)
b) Use the R-squared here and the R-squared in your first regression to do an F-test to show if the
gender and interaction variables together accomplished anything. (2)
c) Compute the sample correlation between income and expenditure and test it for significance.
(4.5)
d) Test the same correlation to see if it is .9. (5)
e) Compute the rank correlation between income and expenditure and test it for significance.
Please do not forget to rank the variables first. (5) [19, 117]
Solution: a) Check the above printout i) to see if adding the interaction variable improved our regression
and ii) to compute a partial correlation between interaction and expenditure. Explain the meaning
of the partial correlation. (2.5)
i) Well R-square adjusted rose, but the fact that none of the coefficients are significant at
the 5% level indicates that we have accomplished very little.
t2
 1.512  0.1667 . This indicated that after
ii) The outline says rY23.12  2 3

t 3  df  1.512  6
Income and Gender are accounted for, there is a relatively weak linear relationship
between Interaction and Expenditure.
b) Use the R-squared here and the R-squared in your first regression to do an F-test to show if the
gender and interaction variables together accomplished anything. (2)
n  10
First regression: k  1 ; R 2  .7788 R 2  .7511 .
This regression: k  3 ; R 2  .864 R 2  .796
Source
SS
DF
MS
RY2.1  0.7788
1
.7788
2nd & 3rd Regression RY2.123  RY2.1  0.864  0.7788  .085
2
.0425
6
.0227
First Regression
Error
Total
1  RY2.123  1  0.864  .136
Fcalc
F
1,6   5.99
34.31s F.05
2,6   5.14
1.87ns F.05
Column must add to 1.000
9
No surprise! Although the adjusted R-squared rose, the insignificant F means that both variables
together do little.
30
252y0741 5/7/07
c) Compute the sample correlation between income and expenditure and test it for significance.
(4.5) We already know that R 2  .7788 , n  10 and df  n  k  1  8 . Since the regression
coefficient is positive, we can say r   .7766  .8825

r
H 0 : xy  0

Test 
The test of the null hypothesis   0 is t n  2  
sr

H 1 : xy  0
.8825


r
1 r 2
n2
.8825
8
 5.2895 . The ‘reject’ regions are below  t .025
 2.306 and above
0.16711
1  0.7766
8

8
t .025  2.306 . Since our computed t ratio is in the upper ‘reject’ zone, reject the null hypothesis.
d) Test the same correlation to see if it is .9. (5)

H 0 : xy  0.9
Test 
when n  10, r  .8825 and r 2  .7766   .05  .
H
:


0
.
9

 1 xy
This time compute Fisher's z-transformation (because  0 is not zero)
1  1  r  1  1  .8825  1  1.8825  1
1
~
z  ln 
  ln 
  ln 
  ln 16 .02128   2.77392   1.39696
2  1  r  2  1  .8825  2  0.1175  2
2
1  1 0
 z  ln 
2  1 0
sz 
 1  1  .9  1  1.9  1
1
  ln 
 2  1  .9   2 ln  0.1   2 ln 19 .0000   2 2.94444   1.47222

~
z   z 1.39696  1.47222
1
1
1

 0.199 .


 0.37796 .
Finally t 
n3
10  3
7
sz
0.37796
Compare this with  t n2 2  t .8025  2.306 . Since –0.591 lies between these two values, do not
reject the null hypothesis.
e) Compute the rank correlation between income and expenditure and test it for significance.
Please do not forget to rank the variables first. (5) [19, 117]. The original data is printed below in
columns (1) and (2). The expenditure numbers are ranked in (3) and the income numbers are ranked in (4).
d is the difference between the ranks, which is squared in d 2 . Note that the rank columns must add to
nn  1 10 11

 55 , and that the d column must add to zero.
2
2
(1)
(2)
(3)
(4)
(5)
( 6)
d
Row Expenditure Income Rank Expenditure Rank Income
d2
1
2
3
4
5
6
7
8
9
10
Sum
51
30
32
45
51
31
50
47
45
39
40
25
27
32
45
29
42
38
30
29
9.5
1.0
3.0
5.5
9.5
2.0
8.0
7.0
5.5
4.0
55.0
8.0
1.0
2.0
6.0
10.0
3.5
9.0
7.0
5.0
3.5
55.0
1.5
0.0
1.0
-0.5
-0.5
-1.5
-1.0
0.0
0.5
0.5
0.0
2.25
0.00
1.00
0.25
0.25
2.25
1.00
0.00
0.25
0.25
7.50
31
252y0741 5/7/07
A correlation coefficient between the ranks, rx and ry , can be computed as in c) above, but it is easier to
compute d  rx  ry , and then rs  1 
 d  1  67.50   1  .0455  .9545 . This can be given a t
10 100  1
nn  1
2
6
2
test for H 0 :  0 as in b) above, but for n between 4 and 30, a special table should be used. Part of
Table from the Supplement is repeated below. We do not reject the null hypothesis of no rank correlation if
rs is between ±.6364. Since our value of the rank correlation is above +.6364, reject the null hypothesis.
n
  .100
  .050
  .025
  .010
  .005
  .001
4
5
6
7
8
9
10
.8000
.7000
.6000
.5357
.5000
.4667
.4424
.8000
.8000
.7714
.6786
.6190
.5833
.5515
.9000
.8286
.7450
.7143
.6833
.6364
.9000
.8857
.8571
.8095
.7667
.7333
.9429
.8929
.8571
.8167
.8167
.9643
.9286
.9000
.8667
32
252y0741 5/7/07
7) The following are entirely tests of proportions. Samples of senior executives were polled on three
separate dates as to the business outlook. Please test the following using a 99% confidence level.
a) In Year 1 a third of executives were unsure or prediction no change. (2)
b) The proportion that predicted no change or were undecided did not change between Year 1 and
Year 2. Do this problem using a confidence interval, a test ratio and a critical value (5) or just by
using one of these three methods. (3)
c) In b) do a 97% confidence interval for the difference between the proportions that predicted no
change. Do not use a value from the t-table to do this. (2)
d.) The executives' outlook did not change over time (5). [14, 131]
Outlook Year 1 Year 2 Year3 Total
Go Up
152
177
101
430
Go Down
104
72
36
212
? or Same
144
152
265
561
Total
400
401
402
1203
Solution: a) In Year 1 a third of executives were unsure or prediction no change. (2)
From Table 4 of the syllabus supplement, we have the following.
Interval for
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
p  p  z 2 s p
Proportion
sp 
H 0 : p  p0
z
H1 : p  p0
pq
n
pcv  p0  z 2  p
p  p0
p
p0 q0
n
q0  1  p0
p 
q  1 p
144
 .360 . Note that a few
400
people told me that they either did or didn’t think that .36 was close to 13 . That’s a good way to
make me suspect that you haven’t learned anything in this course.
H0 : p 
1
3
H1 : p 
1
3.
  .01 , z   z .005  2.576 and p 
2
Test Ratio Method:  p 
z
p  p0
p

p0 q 0

n
1
3
2 3
400
 0.02357 and p  .360
.360  13
 1.13137 . This is between 2.576 , so do not reject
.02357
H0 .
Critical Value Method: pcv  p0  z  p 
2
1

  13  0.0607 . Do not
3  2.576 0.02357
reject the null hypothesis if p is between .273 and .283. This interval includes .36 so do
not reject H 0 .
Confidence Interval Method: s p 
pq
.360 .640 

 0.02400
n
400
p  p  z 2 s p .36  196
. .024 .36.047 or .313 to .407. Since this interval includes
1
3
do not reject H 0 .
33
252y0741 5/7/07
b) The proportion that predicted no change or were undecided did not change between Year 1 and
Year 2. Do this problem using a confidence interval, a test ratio and a critical value (5) or just by
using one of these three methods. (3)
From Table 4 of the syllabus supplement, we have the following.
Interval for
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
pcv  p0  z 2  p
Difference
p  p 0
H 0 : p  p 0
p  p  z 2 s p
z
between
If
p  0
 p
H 1 : p  p 0
p  p1  p 2
proportions
 1
1 
If p  0
 p  p 0 q 0   
p 0  p 01  p 02
p1 q1 p 2 q 2
q  1 p
n
n
2 
 1
s p 

p 01q 01 p 02 q 02
 p 

or p 0  0
n1
n2
n1
n2
n1 p1  n 2 p 2
p0 
n1  n 2
Or use s p
H 1 : p1  p 2 or p  0 .   .01 , z   z .005  2.576 ,
H 0 : p1  p 2 or p  0
2
p1 
144
152
 .3600 and p 2 
 .3791
400
401
n p  n2 p2 144  152

.3695
Test Ratio Method: p0  1 1
n1  n2
400  401
 p 
p0 q 0

1
n1

1
n2

.3695.6305 1 400  1 401 
.00116 .03413
p  p1  p 2  144 400  152 401  .3600  .3791  .0191
p  p 0 .0191  0
z

 0.5596 . Between 2.576 so do not reject H 0 .
 p
.03413
Critical Value Method: pcv  p0  z  p  0  2.5760.03413  0.0879 . This
2
interval includes -.0191 so do not reject H 0 .
Confidence Interval Method: s p 
p1 q1 p 2 q 2
.3600 .6400  .3791 .6209 



400
401
n1
n2
 .00057600 .00058699  .00116299  .034103 .
So p  p  z s p
2
 .0191  2.576 .034103   .0191  .08784 . Since the error part of the interval is larger
in absolute value than -.0191, the interval includes zero. So do not reject H 0 .
c) In b) do a 97% confidence interval for the difference between the proportions that predicted no
change. Do not use a value from the t-table to do this. (2)
  .03 , z   z .015 and s p  .034103. Make a diagram! Show a Normal curve for z with a
2
center at zero. By definition z .015 is the point with 1.5% above it. Since the probability below zero
is .5, the diagram should show P0  z  z.015   .5  .0150  .4850 . According to the Normal table
P0  z  2.17   .4850 , so z .015  2.17 . If we wish to check this, Pz  2.17 
 Pz  0  P0  z  2.17   .5  .4850  .0150 . So our interval is p  p  z 2 s p
 .0191  2.17 .034103   .0191  .0740 or -.093 to .055.
34
252y0741 5/7/07
d.) The executives' outlook did not change over time (5). [14, 131] A test of multiple proportions is
a Chi-square Test of Homogeneity. H 0 is homogeneity. The given table is repeated below with
row sums and the fraction in each row added.
Outlook Year 1 Year 2 Year 3 Total
pr
Go Up
152
177
101
430
.3574
Go Down
104
72
36
212
.1762
? or Same
144
152
265
561
.4663 .
Total
400
401
402 1203
p r is the fraction of the total of 1203 in each row.
.9999
O Year1 Year 2 Year 3 Total
Up
152
177
101
430
Our observed data can be displayed as Down
104
72
36
Same
144
152
265
561
400
401
402
1203
E
Up
Year1 Year 2 Year 3
142 .98 143 .33 143 .69
212 and our expected
Total
430 .00
values are Down
70 .49
70 .67
70 .84
212 .00 , where 142.98, for example is
Same
186 .53
187 .00
187 .47
400 .00
401 .00
402 .00 1203 .00
561 .00
.3574 400 . (.3574 is the fraction of the data that was observed in the ‘Up’ row and 400 was the
sum of the ‘Year 1’ column.
 2 computations are done two ways below. Degrees of freedom are  r  1 c  1  2 2  4 .
O  E 
O2
O E
E
E
152
142.98 161.59
9.024 0.570
177
70.49 444.44 106.510 160.934
101
186.53 54.69 -85.534 39.221
104
143.33 75.46 -39.333 10.794
72
70.67 73.36
1.333 0.025
36
187.00
6.93 -151.000 121.930
144
143.69 144.31
0.309 0.001
152
70.84 326.13 81.157 92.973
265
187.47 374.60 77.534 32.067
1203
1203.00 1661.51
0.000 458.514
2
2
  166151
.  1203  45851
. and  .05  9.4877 . So, since our computed chi-square is larger than
the chi-square from the table, reject H 0 and conclude that the executives’ outlook did change over
time.
2
O
E
35
Download