252y0781 12/11/07 KEY

advertisement
252y0781 12/11/07
ECO252 QBA2
Final Exam
December 12-15, 2007
Version 1
Name and Class hour:__________KEY_______
I. (25+ points) Do all the following. Note that answers without reasons and/or citation of appropriate
statistical tests receive no credit. Most answers require a statistical test, that is, stating or implying a
hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it
lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere
Else). There are over 150 possible points, but the exam is normed on 75 points.
In the Lees’ 2000 text they noted that before 1979 the Federal Reserve targeted interest rates, letting the
money supply grow in such a way that the interest rates would remain stable. After 1979, the Fed switched
to targeting the money supply. The Lees did a regression of Money supply against GNP (I had to replace
this with GDP.), the prime rate (PrRt) and a dummy variable (Dummy) that is 1 before 1979 and zero from
1979 till 1990, when their analysis stops, They report a high R-squared, and extremely significant
coefficients for the Prime Rate, GNP and the dummy variable, which seems to tell us that the Fed’s change
of regime had a real effect on the money supply. Later in the text they suggest the addition of an interaction
variable (GDPPR), which is the product of the Prime rate and the GDP, and a second interaction variable
(GDPPR). I added the year and its square measured from 1958, population, and GDP squared. My attempt
to update the Lees results was terrible discouraging. The dependent variable is M1 or its logarithm (logM1).
————— 12/3/2007 11:31:46 PM ————————————————————
Welcome to Minitab, press F1 for help.
MTB > WOpen "C:\Documents and Settings\RBOVE\My Documents\Minitab\M1PrRGDP.MTW".
Retrieving worksheet from file: 'C:\Documents and Settings\RBOVE\My
Documents\Minitab\M1PrRGDP.MTW'
Worksheet was saved on Mon Dec 03 2007
MTB > print c5 c2 c4 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15
Data Display
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
C5
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
M1
140.0
140.7
145.2
147.8
153.3
160.3
167.8
172.0
183.3
197.4
203.9
214.4
228.3
249.2
262.9
274.2
287.1
306.2
330.9
357.3
381.8
408.5
436.7
474.8
521.4
551.6
619.8
724.7
750.2
786.7
792.9
824.7
896.9
1024.8
1129.7
1150.7
PrRt
4.50
5.00
4.50
4.50
4.50
4.50
4.50
5.52
5.50
6.50
8.23
8.00
5.50
5.04
7.49
11.54
7.07
7.20
6.75
8.63
11.65
12.63
20.03
16.50
10.50
12.60
9.78
8.50
8.25
9.00
11.07
10.00
8.50
6.50
6.00
7.25
GDP
$506.60
$526.40
$544.70
$585.60
$617.70
$663.60
$719.10
$787.80
$832.60
$910.00
$984.60
$1,038.50
$1,127.10
$1,238.30
$1,382.70
$1,500.00
$1,638.30
$1,825.30
$2,030.90
$2,294.70
$2,563.30
$2,789.50
$3,128.40
$3,255.00
$3,536.70
$3,933.20
$4,220.30
$4,462.80
$4,739.50
$5,103.80
$5,484.40
$5,803.10
$5,995.90
$6,337.70
$6,657.40
$7,072.20
Dummy
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
GDPPr
2280
2632
2451
2635
2780
2986
3236
4349
4579
5915
8103
8308
6199
6241
10356
17310
11583
13142
13709
19803
29862
35231
62662
53708
37135
49558
41275
37934
39101
45934
60712
58031
50965
41195
39944
51273
GDPdum
506.6
526.4
544.7
585.6
617.7
663.6
719.1
787.8
832.6
910.0
984.6
1038.5
1127.1
1238.3
1382.7
1500.0
1638.3
1825.3
2030.9
2294.7
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
year
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
yearsq
1
4
9
16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484
529
576
625
676
729
784
841
900
961
1024
1089
1156
1225
1296
1
252y0781 12/11/07
37
38
39
40
41
42
43
44
45
46
47
48
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
1127.4
1081.4
1072.8
1095.9
1123.0
1087.7
1182.0
1219.5
1305.5
1375.2
1373.2
1365.9
Pop
176289
179979
182992
185771
188483
191141
193526
195576
197457
199399
201385
203984
206827
209284
211357
213342
215465
217583
219760
222095
224567
227225
229466
231664
233792
235825
237924
240133
242289
244499
246819
249623
252981
256514
259919
263126
266278
269394
272647
275854
279040
282217
285226
288126
290796
293638
296507
299398
9.00
8.25
8.50
8.50
7.75
9.50
6.98
4.75
4.22
4.01
6.01
8.02
GDPsq
256644
277097
296698
342927
381553
440365
517105
620629
693223
828100
969437
1078482
1270354
1533387
1911859
2250000
2684027
3331720
4124555
5265648
6570507
7781310
9786887
10595025
12508247
15470062
17810932
19916584
22462860
26048774
30078643
33675970
35950817
40166441
44320975
50016013
54725965
61103926
68961398
76510009
85903239
96373489
102576384
109612524
120139137
136560259
154601869
174100108
$7,397.70
$7,816.90
$8,304.30
$8,747.00
$9,268.40
$9,817.00
$10,128.00
$10,469.60
$10,960.80
$11,685.90
$12,433.90
$13,194.70
log M1
4.94164
4.94663
4.97811
4.99586
5.03240
5.07705
5.12277
5.14749
5.21112
5.28523
5.31763
5.36784
5.43066
5.51826
5.57177
5.61386
5.65983
5.72424
5.80182
5.87858
5.94490
6.01249
6.07925
6.16289
6.25652
6.31282
6.42940
6.58576
6.62034
6.66785
6.67570
6.71502
6.79894
6.93225
7.02971
7.04813
7.02767
6.98601
6.97803
6.99933
7.02376
6.99182
7.07496
7.10620
7.17434
7.22635
7.22490
7.21957
0
0
0
0
0
0
0
0
0
0
0
0
66579
64489
70587
74350
71830
93262
70693
49731
46255
46860
74728
105821
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
37
38
39
40
41
42
43
44
45
46
47
48
1369
1444
1521
1600
1681
1764
1849
1936
2025
2116
2209
2304
logM1l
4.89222
4.94164
4.94663
4.97811
4.99586
5.03240
5.07705
5.12277
5.14749
5.21112
5.28523
5.31763
5.36784
5.43066
5.51826
5.57177
5.61386
5.65983
5.72424
5.80182
5.87858
5.94490
6.01249
6.07925
6.16289
6.25652
6.31282
6.42940
6.58576
6.62034
6.66785
6.67570
6.71502
6.79894
6.93225
7.02971
7.04813
7.02767
6.98601
6.97803
6.99933
7.02376
6.99182
7.07496
7.10620
7.17434
7.22635
7.22490
I followed the course suggested by the textbook to find what variables were actually important in predicting
the money supply. The following is quoted from an earlier final exam solution.
The best subsets approach, according to Behrenson et. al., involves: (i) Choosing a large
set of candidate independent variables; (ii) running a regression with all the candidate
variables and using the VIF option, which tests for collinearity; (iii) eliminating
variables with a VIF over 5; (iv) continuing to run regressions and eliminate candidate
variables until there are no variables with a VIF over 5; (v) performing a best-subsets
regression on the model without high VIFs and computing C p ; (vi) shortlisting the
2
252y0781 12/11/07
models with a C p less than or close to k  1 , where k is the number of independent
variables in that regression; (vii) choosing from the shortlist on the basis of things like
significance of coefficients and R-squared; (viii) using residual analysis and influence
analysis to further refine the model by adding nonlinear terms, transforming variables
and eliminating suspicious observations. Note that terms like a squared term are largely
exempt from the VIF rules, if they are correlated with the untransformed variable.
Results for: M1PrRGDP.MTW
MTB > Regress c2 5 c4 c6 c7 c10 c12;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression 1
Regression Analysis: M1 versus PrRt, GDP, Dummy, year, Pop
The regression equation is
M1 = 2874 - 19.1 PrRt + 0.0714 GDP - 115 Dummy + 46.2 year - 0.0149 Pop
Predictor
Coef
SE Coef
T
Constant
2874
1232
2.33
PrRt
-19.116
3.941 -4.85
GDP
0.07138
0.01762
4.05
Dummy
-114.81
48.62 -2.36
year
46.23
15.57
2.97
Pop
-0.014888 0.007176 -2.07
S = 57.7863
R-Sq = 98.4%
R-Sq(adj)
Analysis of Variance
Source
DF
SS
Regression
5 8498077
Residual Error 42
140249
Total
47 8638326
Source
PrRt
GDP
Dummy
year
Pop
DF
1
1
1
1
1
MS
1699615
3339
P
VIF
0.025
0.000
2.241
0.000
62.461
0.023
8.260
0.005 668.523
0.044 917.418
= 98.2%
F
508.98
P
0.000
Seq SS
3746
8260319
139454
80187
14371
Unusual Observations
Obs PrRt
M1
Fit SE Fit Residual St Resid
23 20.0
436.70 361.08
37.33
75.62
1.71 X
35
6.0 1129.70 982.60
18.35
147.10
2.68R
36
7.3 1150.70 986.80
14.01
163.90
2.92R
37
9.0 1127.40 975.89
11.81
151.51
2.68R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
So the regression above was my first attempt. There are several questions that can be asked at this point.
1) Why does this regression look awfully good as far as significance and the amount of the variation in the
Y variable that is explained by the equation? (3) Answer: The p-value for the ANOVA is below 1%, which
means that there is a significant relationship between the X values and Y. We have an R-squared of 98.4%.
This means that the regression explains 98.4% of the variation in Y. Also every p-value is below 5%, which
means that all of the coefficients are significant at the 5% level.
A coefficient,  i is insignificant if the null hypothesis H 0 :  i  0 cannot be shown to be false
at a reasonable significance level (Usually   .05 or   .01 ). In practice this means that the
b
t-ratio t  i is not between t  or the pvalue  2P t  t computed or, if the t-ratio is
2
s bi




negative, pvalue  2P t  t computed is below a reasonable significance level.
3
252y0781 12/11/07
2) There are only two coefficients here whose sign you can predict in advance. What are they, what did you
predict and why and were you right? (2) Answer: The amount of money demanded should fall as the
interest rate rises, so the coefficient of PrRt should be negative. On the other hand, we should expect the
need for money to rise as the GDP rises, so the coefficient of GDP should be positive. These are both as
forecasted in this regression.
3) What does the Analysis of Variance tell us? What hypothesis did it cause you to reject? (1) Answer: The
p-value for the ANOVA is below 1%, which means that there is a significant relationship between the X
values and Y. The null hypothesis for a basic regression ANOVA is that there is no linear relation between
the X variables and Y.
MTB > Regress c2 4 c4 c6 c7 c10 ;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression 2
Regression Analysis: M1 versus PrRt, GDP, Dummy, year
The regression equation is
M1 = 321 - 20.7 PrRt + 0.0415 GDP - 174 Dummy + 14.5 year
Predictor
Coef SE Coef
Constant
321.24
66.06
PrRt
-20.668
4.016
GDP
0.04152 0.01055
Dummy
-173.71
40.96
year
14.530
3.077
S = 59.9651
R-Sq = 98.2%
T
P
VIF
4.86 0.000
-5.15 0.000
2.160
3.94 0.000 20.791
-4.24 0.000
5.444
4.72 0.000 24.254
R-Sq(adj) = 98.0%
Analysis of Variance
Source
DF
SS
Regression
4 8483706
Residual Error 43
154620
Total
47 8638326
MS
2120927
3596
Source
PrRt
GDP
Dummy
year
DF
1
1
1
1
F
589.83
P
0.000
Seq SS
3746
8260319
139454
80187
Unusual Observations
Obs PrRt
M1
Fit SE Fit Residual St Resid
23 20.0
436.70 371.34
38.39
65.36
1.42 X
35
6.0 1129.70 982.21
19.04
147.49
2.59R
36
7.3 1150.70 988.13
14.53
162.57
2.79R
37
9.0 1127.40 980.00
12.08
147.40
2.51R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
MTB > Regress c2 3 c4 c6 c7 ;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
Brief 2.
Regression 3
Regression Analysis: M1 versus PrRt, GDP, Dummy
The regression equation is
M1 = 451 - 14.3 PrRt + 0.0865 GDP - 240 Dummy
Predictor
Coef
SE Coef
T
P
VIF
Constant
450.99
73.19
6.16 0.000
PrRt
-14.269
4.605 -3.10 0.003 1.914
GDP
0.086456 0.005548 15.58 0.000 3.875
Dummy
-239.76
46.90 -5.11 0.000 4.809
S = 73.0515
R-Sq = 97.3%
R-Sq(adj) = 97.1%
Analysis of Variance
4
252y0781 12/11/07
Source
Regression
Residual Error
Total
Source
PrRt
GDP
Dummy
DF
1
1
1
DF
3
44
47
SS
8403519
234807
8638326
MS
2801173
5337
F
524.91
P
0.000
Seq SS
3746
8260319
139454
Unusual Observations
Obs PrRt
M1
Fit SE Fit Residual St Resid
23 20.0
436.7 435.7
43.7
1.0
0.02 X
35
6.0 1129.7 941.0
20.6
188.7
2.69R
36
7.3 1150.7 959.0
16.0
191.7
2.69R
37
9.0 1127.4 962.1
14.0
165.3
2.30R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
4) What did I do to get from Regression 1 to regression 3 and why? (2) Answer: I removed ‘Year’ and
‘Pop’ because they had the highest VIF. This means that there was a great deal of collinearity present and
that these variables were not really needed. Note that R-squared has barely fallen.
5) Why was I now ready to quit dropping variables and do a ‘best subsets’ regression? (1) [9] Answer: The
VIFs are now all below 5. Of course, the signs of the coefficients and their significance are fine too.
6) What would the money supply be that would be predicted for 1970 assuming that the numbers given for
1970 are correct? By what percent is it off the actual value? (2) Answer: The regression equation is
M1 = 450.99 - 14.269 PrRt + 0.086456 GDP – 239.86 Dummy. The line from our data says
Row C5
12 1970
M1
214.4
PrRt
8.00
GDP
$1,038.50
Dummy
1
GDPPr GDPdum
8308 1038.5
year
12
yearsq
144
So M1 = 450.99 - 14.269(8.00) + 0.086456(1038.50) – 239.86(1) = 450.99 - 114.152
+ 89.785 – 239.86(1) = 186.763. The observed value was 214.4. The difference between the
two numbers is about 13% of the observed value.
7) Can you make this into a rough prediction interval? Does this include the actual value for 1970? (2) [13]
Answer: The outline says “We can use this ( s e ) to find an approximate prediction interval Y0  Yˆ0  t s e .”
44 
 2.015. The
The ANOVA says that there are 44 degrees of freedom and that S = 73.0515. We can use t .025
interval is thus 186 .763  2.015 73.0515   186 .8  147 .2 . This rather gigantic interval includes the actual
value.
5
252y0781 12/11/07
MTB > BReg c2 c4 c6 c7 ;
SUBC>
NVars 1 3;
SUBC>
Best 2;
SUBC>
Constant.
Regression 4
Best Subsets Regression: M1 versus PrRt, GDP, Dummy
Response is M1
Vars
1
1
2
2
3
R-Sq
95.6
67.8
96.7
95.7
97.3
R-Sq(adj)
95.6
67.1
96.5
95.5
97.1
Mallows
Cp
26.5
477.7
11.6
28.1
4.0
S
90.432
246.02
79.727
91.197
73.051
D
P
u
r G m
R D m
t P y
X
X
X X
X X
X X X
8) What is Regression 4 telling me to do? Why can you say that? (2) Answer: The only satisfactory value
of Mallow’s Cp is 4.0, which is the regression with all three independent variables.
MTB > Regress c2 3 c4 c6 c7 ;
SUBC>
GFourpack;
SUBC>
RType 1;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
DW;
SUBC>
Brief 2.
Regression 5
Regression Analysis: M1 versus PrRt, GDP, Dummy
The regression equation is
M1 = 451 - 14.3 PrRt + 0.0865 GDP - 240 Dummy
Predictor
Coef
SE Coef
T
P
VIF
Constant
450.99
73.19
6.16 0.000
PrRt
-14.269
4.605 -3.10 0.003 1.914
GDP
0.086456 0.005548 15.58 0.000 3.875
Dummy
-239.76
46.90 -5.11 0.000 4.809
S = 73.0515
R-Sq = 97.3%
R-Sq(adj) = 97.1%
Analysis of Variance
Source
DF
SS
Regression
3 8403519
Residual Error 44
234807
Total
47 8638326
Source
PrRt
GDP
Dummy
DF
1
1
1
MS
2801173
5337
F
524.91
P
0.000
Seq SS
3746
8260319
139454
Unusual Observations
Obs PrRt
M1
Fit SE Fit Residual St Resid
23 20.0
436.7 435.7
43.7
1.0
0.02 X
35
6.0 1129.7 941.0
20.6
188.7
2.69R
36
7.3 1150.7 959.0
16.0
191.7
2.69R
37
9.0 1127.4 962.1
14.0
165.3
2.30R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Durbin-Watson statistic = 0.445619
6
252y0781 12/11/07
Residual Plots for M1
9) Regression 5 is just a repeat of regression 3, but now I am doing residual analysis. What are the DurbinWatson statistic and the plot of residuals vs. order telling me is present? What 2 conditions for regression
seem to be being violated? (3) [18] Answer: The low value of the Durbin-Watson statistics and the graphs
seem to show a lot of serial correlation. The errors seem to come in waves. Furthermore, the graphs seem to
indicate that the errors seem to be getting bigger as time passes (and GDP grows). The basic theorem for
regression says that there should be no serial correlation and that the errors should not correlate with the
independent variables.
MTB > Regress c2 4 c4 c6 c7 c13;
SUBC>
GFourpack;
SUBC>
RType 1;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
DW;
SUBC>
Brief 2.
Regression 6
Regression Analysis: M1 versus PrRt, GDP, Dummy, GDPsq
The regression equation is
M1 = 131 - 13.1 PrRt + 0.187 GDP - 26.3 Dummy - 0.000007 GDPsq
Predictor
Coef
SE Coef
T
P
Constant
131.36
64.18
2.05 0.047
PrRt
-13.142
3.050 -4.31 0.000
GDP
0.18659
0.01370 13.62 0.000
Dummy
-26.33
41.88 -0.63 0.533
GDPsq
-0.00000671 0.00000088 -7.59 0.000
S = 48.3231
R-Sq = 98.8%
R-Sq(adj) = 98.7%
Analysis of Variance
Source
DF
SS
Regression
4 8537916
Residual Error 43
100410
Total
47 8638326
Source
PrRt
GDP
Dummy
GDPsq
DF
1
1
1
1
MS
2134479
2335
F
914.07
VIF
1.919
53.994
8.764
33.120
P
0.000
Seq SS
3746
8260319
139454
134396
7
252y0781 12/11/07
Unusual Observations
Obs PrRt
M1
Fit SE Fit Residual St Resid
23 20.0
436.70
386.21
29.65
50.49
1.32 X
35
6.0 1129.70
997.46
15.53
132.24
2.89R
36
7.3 1150.70 1020.24
13.32
130.46
2.81R
37
9.0 1127.40 1026.39
12.53
101.01
2.16R
42
9.5 1087.70 1191.94
14.93
-104.24
-2.27R
48
8.0 1365.90 1320.38
30.91
45.52
1.23 X
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Durbin-Watson statistic = 0.551845
Residual Plots for M1
10) I now felt free to add the square of GDP as a new independent variable? What happened to the VIFs?
Do I care? Why? (2) Answer: The VIFs went up. But they only show that there is a close relationship
between GDP and its square. Since these are, in a sense, the same variable, I don’t care.
11) What did adding the square of GDP do to the significance of my coefficients and the fraction of the
variation of Y that is explained by the equation? (2) [22] Answer: This is the real problem here. The pvalue for the coefficient of the dummy variable has gone through the roof, indicating that it is no longer
significant. Both r-square and r-squared adjusted have risen, so maybe we are doing something right.
MTB > let c14 = loge (c2)
8
252y0781 12/11/07
MTB > Regress c14 4 c4 c6 c7 c13;
SUBC>
GFourpack;
SUBC>
RType 1;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
DW;
SUBC>
Brief 2.
Regression 7
Regression Analysis: log M1 versus PrRt, GDP, Dummy, GDPsq
The regression equation is
log M1 = 4.79 + 0.00846 PrRt + 0.000453 GDP + 0.0289 Dummy - 0.000000 GDPsq
Predictor
Coef
SE Coef
T
P
Constant
4.7882
0.1358
35.26 0.000
PrRt
0.008461
0.006453
1.31 0.197
GDP
0.00045309 0.00002899
15.63 0.000
Dummy
0.02889
0.08862
0.33 0.746
GDPsq
-0.00000002 0.00000000 -11.66 0.000
S = 0.102246
R-Sq = 98.5%
R-Sq(adj) = 98.4%
Analysis of Variance
Source
DF
SS
Regression
4 29.3981
Residual Error 43
0.4495
Total
47 29.8476
Source
PrRt
GDP
Dummy
GDPsq
DF
1
1
1
1
MS
7.3495
0.0105
F
703.01
VIF
1.919
53.994
8.764
33.120
P
0.000
Seq SS
1.2680
25.6375
1.0725
1.4202
Unusual Observations
Obs PrRt log M1
Fit SE Fit Residual St Resid
23 20.0 6.0792 6.1618 0.0627
-0.0826
-1.02 X
42
9.5 6.9918 7.2158 0.0316
-0.2239
-2.30R
48
8.0 7.2196 7.0393 0.0654
0.1803
2.29RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Durbin-Watson statistic = 0.306367
Residual Plots for log M1
9
252y0781 12/11/07
12) I just replaced the money supply by its logarithm. The residual analysis tells me this was a sort of good
idea? What does that mean? (1)[23] Answer: The last diagram seems to show that there is a somewhat
smaller tendency of the error to grow with time.
13) What is really weird about these coefficients? Which one has the wrong sign? (1) Answer: The
coefficients are awfully small, though this probably could be fixed for GDP and GDP squared, though better
technique might have fixed this by dividing GDP by 10 3 (and GDP squared by 10 6 ) before we started. But
the coefficient of the interest rate now has an unexpected positive sign, and it is also insignificant. The
dummy variable also has an enormous p-value.
MTB > Regress c14 3 c4 c6
SUBC>
GFourpack;
SUBC>
RType 1;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
DW;
SUBC>
Brief 2.
c13;
Regression 8
Regression Analysis: log M1 versus PrRt, GDP, GDPsq
The regression equation is
log M1 = 4.83 + 0.00732 PrRt + 0.000445 GDP - 0.000000 GDPsq
Predictor
Coef
SE Coef
T
P
Constant
4.83016
0.04310 112.06 0.000
PrRt
0.007316
0.005359
1.37 0.179
GDP
0.00044536 0.00001650
26.99 0.000
GDPsq
-0.00000002 0.00000000 -15.60 0.000
S = 0.101203
R-Sq = 98.5%
R-Sq(adj) = 98.4%
Analysis of Variance
Source
DF
SS
Regression
3 29.3970
Residual Error 44
0.4506
Total
47 29.8476
Source
PrRt
GDP
GDPsq
DF
1
1
1
MS
9.7990
0.0102
F
956.75
VIF
1.351
17.854
18.176
P
0.000
Seq SS
1.2680
25.6375
2.4915
Unusual Observations
Obs
23
42
48
PrRt
20.0
9.5
8.0
log M1
6.0792
6.9918
7.2196
Fit
6.1606
7.2104
7.0413
SE Fit
0.0620
0.0267
0.0644
Residual
-0.0814
-0.2186
0.1783
St Resid
-1.02 X
-2.24R
2.28RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Durbin-Watson statistic = 0.289829
10
252y0781 12/11/07
Residual Plots for log M1
MTB > Regress c14 2
SUBC>
GFourpack;
SUBC>
RType 1;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
DW;
SUBC>
Brief 2.
c6
c13;
Regression 9
Regression Analysis: log M1 versus GDP, GDPsq
The regression equation is
log M1 = 4.87 + 0.000457 GDP - 0.000000 GDPsq
Predictor
Coef
SE Coef
T
P
Constant
4.87027
0.03184 152.96 0.000
GDP
0.00045654 0.00001446
31.58 0.000
GDPsq
-0.00000002 0.00000000 -18.76 0.000
S = 0.102169
R-Sq = 98.4%
R-Sq(adj) = 98.4%
Analysis of Variance
Source
DF
SS
Regression
2 29.378
Residual Error 45
0.470
Total
47 29.848
Source
GDP
GDPsq
DF
1
1
MS
14.689
0.010
F
1407.18
VIF
13.455
13.455
P
0.000
Seq SS
25.705
3.673
Unusual Observations
Obs
GDP log M1
Fit SE Fit Residual St Resid
42
9817 6.9918 7.1988 0.0256
-0.2070
-2.09R
47 12434 7.2249 7.0925 0.0478
0.1324
1.47 X
48 13195 7.2196 7.0041 0.0590
0.2154
2.58RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Durbin-Watson statistic = 0.208342
11
252y0781 12/11/07
Residual Plots for log M1
MTB > Regress c14 3
SUBC>
GFourpack;
SUBC>
RType 1;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
DW;
SUBC>
Brief 2.
c6
c13 c8;
Regression 10
Regression Analysis: log M1 versus GDP, GDPsq, GDPPr
The regression equation is
log M1 = 4.87 + 0.000465 GDP - 0.000000 GDPsq - 0.000001 GDPPr
Predictor
Coef
SE Coef
T
P
VIF
Constant
4.86787
0.03240 150.23 0.000
GDP
0.00046548 0.00002208
21.08 0.000 30.892
GDPsq
-0.00000002 0.00000000 -16.38 0.000 17.958
GDPPr
-0.00000070 0.00000130
-0.54 0.593
5.889
S = 0.102985
R-Sq = 98.4%
R-Sq(adj) = 98.3%
Analysis of Variance
Source
DF
SS
Regression
3 29.3810
Residual Error 44
0.4667
Total
47 29.8476
MS
9.7937
0.0106
F
923.42
P
0.000
Source DF
Seq SS
GDP
1 25.7052
GDPsq
1
3.6727
GDPPr
1
0.0031
Unusual Observations
Obs
GDP log M1
Fit SE Fit Residual St Resid
42
9817 6.9918 7.1826 0.0396
-0.1908
-2.01R
48 13195 7.2196 6.9803 0.0741
0.2393
3.35RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Durbin-Watson statistic = 0.196041
Residual Plots for log M1
Not Shown.
12
252y0781 12/11/07
14) What has happened to significance and the fraction of the variation in the dependent variable explained
by the regression in Regressions 8), 9) and 10)? In terms of significance etc. which of these 3 is the ‘best’
regression? Why would the Chairman of the FRB be very annoyed? (3) [27] Answer: The values of rsquared and r-squared adjusted barely changed as we dropped the dummy variable and the interest rate.
Trying to put the interest rate back in as an interaction variable did no good. Regression 9 has no
insignificant coefficients, but since the dropped independent variables are ‘monetary’ variables, the
equation seems to say that the FRB has no influence on the money supply.
MTB > Regress c14 4
SUBC>
GFourpack;
SUBC>
RType 1;
SUBC>
Constant;
SUBC>
VIF;
SUBC>
DW;
SUBC>
Brief 2.
c6
c13 c8
c15;
Regression 11
Regression Analysis: log M1 versus GDP, GDPsq, GDPPr, logM1l
The regression equation is
log M1 = - 0.174 + 0.000001 GDP - 0.000000 GDPsq - 0.000001 GDPPr + 1.04 logM1l
Predictor
Coef
SE Coef
T
Constant
-0.1738
0.2820 -0.62
GDP
0.00000085 0.00002708
0.03
GDPsq
-0.00000000 0.00000000 -0.36
GDPPr
-0.00000109 0.00000045 -2.39
logM1l
1.04474
0.05838 17.89
S = 0.0358443
R-Sq = 99.8%
R-Sq(adj) =
Analysis of Variance
Source
DF
SS
Regression
4 29.7924
Residual Error 43
0.0552
Total
47 29.8476
Source
GDP
GDPsq
GDPPr
logM1l
DF
1
1
1
1
MS
7.4481
0.0013
P
0.541
0.975
0.723
0.021
0.000
99.8%
F
5797.02
VIF
383.407
136.981
5.902
80.236
P
0.000
Seq SS
25.7052
3.6727
0.0031
0.4114
Unusual Observations
Obs
GDP
log M1
28
4463 6.58576
37
7398 7.02767
38
7817 6.98601
48 13195 7.21957
Fit
6.49641
7.09767
7.07589
7.18793
SE Fit
0.00824
0.00984
0.00849
0.02830
Residual
0.08935
-0.07000
-0.08988
0.03164
St Resid
2.56R
-2.03R
-2.58R
1.44 X
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Durbin-Watson statistic = 1.17315
Residual Plots for log M1
Not displayed.
15) So what problem did this fix? Incidentally what I added to the independent variables was the money
supply of the previous period? (1) [28] Answer: This represents a half-hearted attempt to fix the
autocorrelation by using the previous years money supply as an independent variable. It raised the D-W
statistic closer to the desired 2, but, if we work with the D-W table in the text, not enough so that we can say
that there is no autocorrelation.
13
252y0781 12/11/07
II. Do at least 4 of the following 8 Problems (at least 12 each) (or do sections adding to at least 50 points –
(Anything extra you do helps, and grades wrap around). It is especially important to do more if you have
skipped much of parts I or II. Show your work! State H 0 and H1 where applicable. Use a significance
level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests
– That is, explain your hypotheses and what values from what table were used to test them. Clearly
label what section of each problem you are doing! The entire test has about 160 points, but 70 is
considered a perfect score. Don’t waste our time by telling me that two means, proportions, variances
or medians don’t look the same to you. You need statistical tests! There are some blank pages below.
Put your name on as many loose pages as possible! Mark sections of your answer clearly.
1). Multiple choice.
a) If I want to test to see if the mean of x1 is larger than the mean of x 2 my null hypothesis is:
(Note: D  1   2 ) Only check one answer!
(2)
i) 1   2 and D  0
ii) 1   2 and D  0
v) 1   2 and D  0
vi) 1   2 and D  0
iii) 1   2 and D  0
vii) 1   2 and D  0
iv) 1   2 and D  0
viii) 1   2 and D  0
No answer is provided because this question will be repeated on future exams.
b) Compared to multiple regression, simple regression is different in having only one
i) Observation
ii) Parameter
iii) Dependent variable
iv) *Independent variable
v) Y-intercept
vi) All of the above
c) For the following quantities, mark their lines with yes (Y) or no (N) as to whether they must be positive
___ R 2 adjusted for degrees of freedom
___ The correlation rx1 x2 between two independent variables x1 and x 2
___ S xy 
 xy  nx y 
___ The coefficient b0 in a multiple regression.
Solution: Though there are many quantities in a multiple regression that must be positive, none of the
above is in that category.
d) Assume that we wish to test the hypothesis that a mean is greater than 3 and we compute the ratio
x 3
t
where our sample statistics are computed from a sample of 29. If   .05 , we reject the null
sx
hypothesis if
i) t is above 1.645 or below -1.645
ii) t is above 1.960 or below -1.960
iii) t is below – 1.645
iv) t is below -1.960
v) t is above 1.645
vi) t is above 1.960
vii) *None of the above. (Fill in a more appropriate answer!)
Explanation: This is for all you out there who don’t believe in the t table. We have a right sided test and
28
 1.701 .
df  n  1  28. Reject the null hypothesis H 0 :   3 if t  t .05
14
252y0781 12/11/07
e) Consumers are asked to take the Pepsi Challenge. They were asked they which cola they preferred and
the number that preferred Pepsi was recorded. Sample 1 was males and sample 2 was females. The
following was run on Minitab
MTB > PTwo 109 46 52 13;
SUBC>
Pooled.
Test and CI for Two Proportions
Sample
X
N Sample p
1
46 109 0.422018
2
13
52 0.250000
Difference = p (1) - p (2)
Estimate for difference: 0.172018
95% CI for difference: (0.0221925, 0.321844)
Test for difference = 0 (vs not = 0): Z = 2.12
P-Value = 0.034
On the basis of the printout above we can say one of the following.
i) At a 99% confidence level we can say that we have enough evidence to state that the proportion
of men that prefer Pepsi differs from the proportion of women that prefer Pepsi
ii) *At a 95% confidence level we can say that we have enough evidence to state that the
proportion of men that prefer Pepsi differs from the proportion of women that prefer Pepsi
iii) At a 99% confidence level we can say that we have enough evidence to state that the proportion
of men that prefer Pepsi equals the proportion of women that prefer Pepsi.
iv) At a 96% confidence level there is insufficient evidence to indicate that the proportion of men
that prefer Pepsi differs from the proportion of women that prefer Pepsi
Explanation: This is a two sided test. The null hypothesis is H 0 : p1  p 2 . Because of the p value of
3.4%, we reject the null hypothesis if   .05 but not if   .01 . On the other hand, a null hypothesis is
never ‘proved.’
f) A researcher is comparing room temperatures preferred by random samples of 135 adults and 80
children. The Minitab output follows.
MTB > TwoT 135 77.5 4.5 80 76.5 2.5;
SUBC>
Alternative 1.
Two-Sample T-Test and CI
Sample
1
2
N
135
80
Mean
77.50
76.50
StDev
4.50
2.50
SE Mean
0.39
0.28
Difference = mu (1) - mu (2)
Estimate for difference: 1.000
95% lower bound for difference: 0.211
T-Test of difference = 0 (vs >): T-Value = 2.09
P-Value = 0.019
DF = 212
On the basis of what you see here and the way we have stated null-alternate hypothesis pairs in class we
come to the following conclusion if we use a 99% confidence level.
i) Do not reject H 0 : 1   2
ii) Do not reject H 0 : 1   2
iii) *Do not reject H 0 : 1   2
iv) Reject H 0 : 1   2
v) Reject H 0 : 1   2
vi) Reject H 0 : 1   2
vii) None of the above (Fill in a more appropriate answer!)
Explanation: This is a one sided test. The alternate hypothesis is H 1 : 1   2 so the opposite null
hypothesis is H 0 : 1   2 . Because of the p value of 1.9%, we do not reject the null hypothesis if   .01 .
15
252y0781 12/11/07
2) The data below represent the sales of Friendly Autos for 7 randomly selected months. They believe that
the number of cars sold depends on the average price for that month (in $ thousands), Number of
advertising spots that appeared on the local TV station and whether other types of advertising were used in
that month (a dummy variable that is 1 if other types of advertising were used in a given month.
Row
1
2
3
4
5
6
7
Sold
10
8
12
13
9
14
15
Price
28.2
28.7
27.9
27.8
28.1
28.8
28.9
Adv
10
6
14
18
10
19
20
Type
1
1
1
0
0
1
1
Product
10
6
14
0
0
19
20
69
Sum of Sold = 81, Sum of Price = 198.4, Sum of Adv = 97, Sum of Sold squared = 979, Sum of Price
squared = 5624.44, Sum of Adv squared = 1517, Sum of Sold * Price = 2297.4, Sum of Sold * Adv = 1206,
Sum of Price * Adv = 2751.4.
x 5 x 6 (2)
a) If advertising (Adv) is x 5 (it isn’t) and Type is x 6 , compute

b) Compute the coefficients of the equation
2
Yˆ  b0  b1 x
to predict the value of ‘Sold’ on the basis of ‘Price.’ (5)
2
c) Compute R and R adjusted for degrees of freedom. (4)
d) Compute the standard error s e . (3)
e) Is the slope of the simple regression significant at the 1% level? Do not answer this question without appropriate calculations! (3)
[17]
f) Is the sign of the coefficient of Price, what you expected? Why or why not? (1)
g) Predict the average number of cars that will be sold when the price is $30 thousand using the equation you got and make it into an
appropriate interval. (4)
h) Do a 1% confidence interval for  o , the y-intercept. (3)
[24, 36]
Solution: The quantities below are given:
y  81,
x  198.4,
x  97,
n  7,

 y  979,  x  5624.44,

 x  1517,  x y  2297.4,  x y  1206 and  x x  2751.4. You do not need all of
these for this problem. Please note that if you told me that  x  ???(198 .4) or  x y  ???198 .481

2
2
1
1
2
2
1
2
2
1 2
2
1
2
1
or anything similar, it was instant death!
Spare Parts Computation:
81
198 .4
97
Y 
 11 .57143 , X 1 
 28 .34286 , X 2 
 13 .85714
7
7
7
 X  nX  SSX  5624 .44  728.34286   1.21601*
†Needed only in next problem
 X  nX  SSX  1517  713.85714  = 172.8577*†
 Y  nY  SST  SSY  979  711.57143  = 41.7141*
 X Y  nX Y  SX Y  2297 .4  728.34286 11.57143  = 1.62806
 X Y  nX Y  SX Y  1206  713.85714 11.57143  = 83.5715†
 X X  nX X  SX X  2751 .4  728.34286 13.85714  = 2.14315†
2
2
1
2
1
1
2
2
2
2
2
2
2
2
2
1
1
2
2
1
2
1
2
1
2
1
2
*Must be positive. The rest may well be negative.
a) If advertising (Adv) is x5 (it isn’t) and Type is x 6 , compute
x
5 x6
(2)
See the column labeled
‘Product’ above.
16
252y0781 12/11/07
b) Compute the coefficients of the equation Yˆ  b0  b1 x to predict the value of ‘Sold’ on the basis of
‘Price.’ (5) The coefficients are b1 
S xy
SS x

 XY  nXY
 X  nX
2
2

1.62806
 1.33885 and
1.21601
b0  Y  b1 X  11 .57143  1.33885  28 .34286   26 .3754 . So Yˆ  26.3754  1.3389 X .
c) Compute R 2 and R 2 adjusted for degrees of freedom. (4) We have found SX 1Y = 1.62806, SSX 1 =
1.21601, b1  1.33885 and SST  SSY = 41.7141. This means that the regression (explained) sum of
squares is SSR  b1 S xy  1.33885 (1.62806 )  2.17973 so R 2 
SSR 2.17973

 .05225 or
SST 41 .7141
1.62806 2  .05225 . These are terribly low, but things get worse when we compute R 2
SS x SS y 1.21601 41 .7141 
n  1R 2  k  60.05225  1  0.1373 . Well I did tell you that it
adjusted for degrees of freedom. R 2 
S xy 2

n  k 1
5
could be negative, but this is ridiculous.
d) Compute the standard error s e . (3)
SSE SST  SSR SS y  b1 S xy 41 .7141  2.1797



 7.30688 s e  7.30688  2.7031
n2
n2
n2
5
e) Is the slope of the simple regression significant at the 1% level? Do not answer this question without
appropriate calculations! (3)
 1  7.30688
 6.00890 s b  2.4513 To test for significance, H 0 : 1  0 , we compute
s b21  s e2 
 
1
 SS x  1.21601
s e2 
t
b1  0 1.62806

 0.6642 . Since   .01 , t.5005 = 4.032. The ‘do not reject’ zone is between ±4.032.
s b1
2.4513
Since our computed t lies between these values, we cannot reject the null hypothesis and we must declare
b1 insignificant.
f) Is the sign of the coefficient of Price, what you expected? Why or why not? (1) If this equation represents
demand for cars, we would expect quantity demanded to rise as price falls, so the coefficient of price should
be negative. This is not happening in this equation. Note that, since we have already shown that the
coefficient is not significant, a confidence interval will include both positive and negative values.
g) Predict the average number of cars that will be sold when the price is $30 thousand using the equation
you got and make it into an appropriate interval. (4) The Confidence Interval is  Y0  Yˆ0  t sYˆ , where
sY2ˆ


1
s e2 


n
2

X 0  X  
, s 2  7.30688 ,

SS x


e
X 1  28 .34286 , SSX 1 = 1.21601 and X 0  30 .
 1 30  28 .34286 2
So Yˆ0  26.3754  1.3389X 0  26.3754  1.3389 30   13.79 and sŶ2  7.30688  
7
1.21601


30  28.34286 2   17.5450 and s  4.18867 . If   .01 , t 5 = 4.032 and if
 7.30688  0.14286 
.005
Yˆ


1.21601






  .05 , t.5025 = 2.015. The 1% interval is ` 13 .79  4.032 4.11867  = 13.79  16.61 , which is amazingly
vague.
1 X 2 
h) Do a 1% confidence interval for  o , the y-intercept. (3) b0  26 .3754 and s b20  s e2  

 n SS x 
17
252y0781 12/11/07
 1 28 .34286 2 
 7.30688  
  689 .579 , which means s b0  26 .2598 so  0  26.3754  4.032 26.25989 
1.21601 
 7
 0  26.3754  105 .8799 , which strongly indicates that the intercept is not significant.
3) The data below represent the sales of Friendly Autos for 7 randomly selected months. They believe that
the number of cars sold depends on the average price for that month (in $ thousands), Number of
advertising spots that appeared on the local TV station and whether other types of advertising were used in
that month (a dummy variable that is 1 if other types of advertising were used in a given month.
Row
1
2
3
4
5
6
7
Sold
10
8
12
13
9
14
15
Price
28.2
28.7
27.9
27.8
28.1
28.8
28.9
Adv
10
6
14
18
10
19
20
Type
1
1
1
0
0
1
1
Sum of Sold = 81, Sum of Price = 198.4, Sum of Adv = 97, Sum of Sold squared = 979, Sum of Price squared = 5624.44, Sum of
Adv squared = 1517, Sum of Sold * Price = 2297.4, Sum of Sold * Adv = 1206, Sum of Price * Adv = 2751.4.
a) Do a multiple regression of ‘Sold’ against ‘Price’ and ‘Advertising.’ Attempts to recycle
b1 from the previous page or to compute
b2 by using a simple regression formula won’t work and won’t get any credit. (12)
b) Compute
R2
and
R2
adjusted for degrees of freedom. (3)
2
c) i) Do an ANOVA for the simple regression using either your regression sum of squares or R (2).
ii) Do a similar ANOVA for the multiple regression. (2) iii) Combine the two ANOVAs to do an F test to see if the addition of ‘Adv’
was worthwhile. (2) [21]
d) Predict the average number of cars that will be sold when the price is $30 thousand and there are 15 spots using the equation you
got and make it into an appropriate interval. (3) [24, 60]
Solution: The Spare Parts Computation is repeated. Serious errors occurred because of excess
rounding.
81
198 .4
97
Y 
 11 .57143 , X 1 
 28 .34286 , X 2 
 13 .85714
7
7
7
 X  nX  SSX  5624 .44  728.34286   1.21601*
 X  nX  SSX  1517  713.85714  = 172.8577*
 Y  nY  SST  SSY  979  711.57143  = 41.7141*
 X Y  nX Y  SX Y  2297 .4  728.34286 11.57143  = 1.62806
 X Y  nX Y  SX Y  1206  713.85714 11.57143  = 83.5715
 X X  nX X  SX X  2751 .4  728.34286 13.85714  = 2.14315
2
2
1
2
1
1
2
2
2
2
2
2
2
2
2
1
1
2
2
1
2
1
2
1
2
1
2
a) Do a multiple regression of ‘Sold’ against ‘Price’ and ‘Advertising.’ Attempts to recycle b1 from the
previous page or to compute b2 by using a simple regression formula won’t work and won’t get any credit.
(12)
We substitute our spare parts into the Simplified Normal Equations:
X 1Y  nX 1Y  b1
X 12  nX 12  b2
X 1 X 2  nX 1 X 2


 X Y  nX Y  b  X X
2
which are
2
1
1
2
 
 nX X   b  X
1
2
2
2
2

,
 nX 22
 1.62806  1.21601 b1  2.14315 b2

83 .5715  2.14315 b1  172 .8577 b2
and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations
to solve until we notice that, if we multiply 2.14315 by 172 .8577
2.14315
 80 .6559 we get 172.8577. If we
18
252y0781 12/11/07
 131 .31313  98 .07758 b1  172 .8577 b2
multiply the first equation by 80.8577, the equations become 
. If
83 .5715  2.14315 b1  172 .8577 b2
47 .7416
 0.4976 . The first of our
we subtract these, we get 47 .7416  95 .9344 b1 . This means that b1 
95 .9334
equations was 1.62806  1.21601 b1  2.14315 b2 . This can be rewritten as
2.14315 b2  1.62806  1.21601 b1 . We may as well divide through by 2.14315 and then substitute in our
value for b1 to get b2  0.759658  0.567394 b1  0.759658  0.567394 0.4976   0.4773
(It’s worth checking your work by substituting your values of b1 and b2 back into the normal equations.)
Finally we get b0 by using Y  11.57143, X 1  28 .34286 and X 2  13 .85714 in b0  Y  b1 X 1  b2 X 2
 11 .57143  0.4976 28.34286   0.4773 13 .85714   11.57143  14.10341  6.614301  9.1460 . Thus
our equation is Yˆ  b  b X  b X  9.1460  0.4976X  0.4773X .
0
1
1
2
2
1
2
b) Compute R 2 and R 2 adjusted for degrees of freedom. (3) Remember that we had RY2.1  .05225 and
RY2.1  0.1373 from the first regression. Our two spare parts indicating interaction between Y and the
independent variables were
 X Y  nX Y  SX Y
1
1
1
= 1.62806 and
X
2Y
 nX 2 Y  SX 2 Y = 83.5715.
The formula for the regression sum of squares is SSR  b1 Sx1 y  b2 Sx2 y
 0.4976 1.62806   0.4773 83.5715   0.8101  39.8887  40.6988 . Our total sum of squares was
Y
2
 nY 2  SST  SSY = 41.7141 so we can say RY2.12 
RY2.12 
SSR 40 .6988

 .9757 and
SST 41 .7141
n  1R 2  k  60.9757  2  .9636 . This, of course, represents a fabulous improvement over our
n  k 1
first regression!
4
c) i) Do an ANOVA for the simple regression using either your regression sum of squares or R 2 (2). Let’s
Y 2  nY 2  SST  SSY = 41.7141 and, for the
collect our stuff again. For both regressions we had

first regression we had SSR  b1 S xy  2.17973 , SSE  SST  SSR  41.7141  2.17973  39.5344 and
SSR
 .005225 . So our ANOVA table will be as below. Of course, there is no need to do both the
SST
‘SS’ and ‘R-squared’ table.
Source
SS
DF
MS
F
F


1
,
5
Regression
2.1797
1
2.1797
0.276
F
 6.61 or F 1,5  16.26
RY2.1 
.01
.05
Error
39.5344
5
7.90688
Total
41.7141
6
If we recall R 2  .2405 for this regression, we can rewrite the table as below.
Source
DF
‘MS’
F
R2
Regression
0.05225
1
0.05225
0.276
F
1,6   5.99 or F 1,5  16 .26
F.05
.01
Error
0.94775
5
0.18955
Total
1.00000
6
Note that, since the computed F is far below the table values, we cannot reject the null hypothesis
of no relationship between the dependent and independent variables.
ii) Do a similar ANOVA for the multiple regression. (2) SSR  b1 Sx1 y  b2 Sx2 y  40 .6988 ,
SSE  SST  SSR  41.7141  40.6988  1.0153
Source
SS
DF
MS
F
F
19
252y0781 12/11/07
Regression
40.6988
2
20.3494
Error
Total
1.0153
41.7141
4
6
0.25383
2,4   6.94 or F 2,6   18 .00
F.05
.01
80.169
If we recall RY2.12  .9757 for this regression, we can rewrite the table as below.
Source
DF
‘MS’
F
R2
Regression
0.9757
2
0.48785
80.238
F
2,4   6.94 or F 2,6   18 .00
F.05
.01
Error
0.0243
4
0.00608
Total
1.0000
6
Note that, since the computed F is far above the table values, we can reject the null hypothesis of
no relationship between the dependent and independent variables.
iii) Combine the two ANOVAs to do an F test to see if the addition of ‘Adv’ was worthwhile. (2) [21]
Use the SS or R-squared in the first table to get a subtotal in the second table.
Source
SS
DF
MS
F
F
Regression
40.6988
2
2.1797
1
X1
1, 4   7.71 or F 1, 4   21 .20
38.5191
1
38.5191
151.75
F.05
X2
.01
Error
Total
1.0153
41.7141
4
6
Source
R2
0.9757
0.05225
0.92345
DF
Regression
X1
0.25383
‘MS’
F
F
2
1
1
1, 4   7.71 or F 1, 4   21 .20
0.92345
151.88
F.05
X2
.01
Error
0.0243
4
0.00608
Total
1.0000
6
Note that, since the computed F is far above the table values, we can reject the null hypothesis of
no additional explanatory power from the added independent variable.
d) Predict the average number of cars that will be sold when the price is $30 thousand and there are 15 spots
using the equation you got and make it into an appropriate interval. (3) [24, 60]
Note that in the ANOVA for the second regression the error mean square is 0.25383. Its square
s
root 0.5038 is the standard error. The outline tells us that  Y0  Yˆ0  t e . Here
n
Yˆ  9.1460  0.4976 X  0.4773 X  9.1460  0.4976 30   0.4773 15   12.9415 ,
0
1,0
2,0
0.25383
4
4
 2.776 or t .005
 4.604 . Thus the interval could be
 0.19042 and we use t .025
7
n
12.94  0.05.
se

20
252y0781 12/11/07
4) The data below represent the sales of Friendly Autos for 7 randomly selected months. They believe that
the number of cars sold depends on the average price for that month (in $ thousands), Number of
advertising spots that appeared on the local TV station and whether other types of advertising were used in
that month (a dummy variable that is 1 if other types of advertising were used in a given month.
Row
1
2
3
4
5
6
7
Sold
10
8
12
13
9
14
15
Price
28.2
28.7
27.9
27.8
28.1
28.8
28.9
Adv
10
6
14
18
10
19
20
Type
1
1
1
0
0
1
1
The Minitab output below gives the full regression of ‘Sold’ against all three independent variables.
Regression Analysis: Sold versus Adv, Price, Type
The regression equation is
Sold = 8.46 + 0.487 Adv - 0.153 Price + 0.982 Type
Predictor
Constant
Adv
Price
Type
Coef
8.457
0.48699
-0.1530
0.9815
S = 0.218501
SE Coef
6.990
0.01696
…………
0.2297
R-Sq = 99.7%
Analysis of Variance
Source
DF
SS
Regression
3 41.571
Residual Error
3
0.143
Total
6 41.714
Source
Adv
Price
Type
DF
1
1
1
T
1.21
28.72
…………
4.27
P
0.313
0.000
0.586
0.024
R-Sq(adj) = 99.3%
MS
13.857
0.048
F
290.24
P
0.000
Seq SS
40.404
0.295
0.872
2
a) Using the material in this output find the value of R for a regression against ‘Adv’ alone. (2)
b) Look at the line that represents the coefficient of ‘Price.’ What about the coefficient makes me happy? What about the coefficient
makes me sad? (2)
c) Find the partial correlation of ‘Type’ with ‘Sold.’ (2)
d) Since you now have enough information to do it, use an F test the see whether the addition of the two advertising independent
variables as a pair was worthwhile. (4) [10]
Solution: a) Using the material in this output, find the value of R 2 for a regression against ‘Adv’ alone.
(2)
The Sequential sum of squares gives us for ‘Adv’, a sum of squares of 40.404. If we divide that by the total
40 .404
 .96869
sum of squares we get RY2.2 
41 .714
b) Look at the line that represents the coefficient of ‘Price.’ What about the coefficient makes me happy?
What about the coefficient makes me sad? (2) The coefficient of ‘Price’ is negative as it should be.
However, with a p-value of .586 it is still not significant.
c) Find the partial correlation of ‘Type’ with ‘Sold.’ (2). According to the outline, the partial correlation,
R2
 R2
.99657  .97567
 .8590 , the additional explanatory power of
which comes from rY23.12  Y .123 2 Y .12 
1  .97567
1  RY .12
the third independent variable after the effects of the first two are considered can be computed easily by
t2
4.27 2

 .8587 ,
using its t-ratio from the computer printout. rY23.12  2 3
4.27 2  3
t 3  df
21
252y0781 12/11/07
d) Since you now have enough information to do it, use an F test the see whether the addition of the two
advertising independent variables as a pair was worthwhile. (4) [10]
The ANOVA given in the printout was as below.
Analysis of Variance
Source
DF
SS
Regression
3 41.571
Residual Error
3
0.143
Total
6 41.714
MS
13.857
0.048
F
290.24
P
0.000
The regression sum of squares for ‘price’ alone was 2.1797. So we can do the following.
Source
SS
DF
MS
F
F
Regression
41.571
3
2.180
1
X1
2,3  9.55 or F 2,3  30 .82
39.391
2
19.696
412.90
F.05
X 2, X 3
.01
Error
0.143
3
0.0477
Total
41.714
6
Note that, since the computed F is far above the table values, we can reject the null
hypothesis of no additional explanatory power from the added independent variables.
e) Compute the correlation between ‘Adv’ and ‘Price’ and test it for significance. Try to use the spare parts
that you already have. (4) [14]
Solution: The relevant parts of the Spare Parts Computation are repeated.
X
X
X
2
1
 nX 12  SSX 1  5624 .44  728 .34286 2  1.21601*
2
2
 nX 22  SSX 2  1517  713 .85714 2 = 172.8577*
1X 2
 nX 1 X 2  SX 1 X 2  2751 .4  728.34286 13.85714  = 2.14315
The simple sample correlation coefficient is r 
 XY  nXY
 X  nX  Y
2
SX 1 X 2
correlation we need is
SSX 1 SSX 2

2
2
 nY

2
Sxy
. So the
SS x SS y
2.14315 2
 0.02185  0.14782 . If we want to
1.21601 172 .8577 
test H 0 : x1x2  0 against H1 : x1x2  0 and x and y are normally distributed, we use
t n  2  
r

sr
r
1 r
n2
2

0.14782
1  0.02185
5
 0.33421 . The rejection zone is above t 5 and below  t 5  . If
2
2
5
5
we use t .005
 4.023 or t .025
 2.571 , it should be clear that we will not reject the null hypothesis of
insignificance.
f) Test the same correlation to see if it is 0.2. (4) [18, 78] To test H 0 : x1x2   0 against H 1 : x1x2   0 ,
1 1 r 
z  ln 
and  0  0 , use Fisher's z-transformation. Let ~
 . This has an approximate mean of
2  1 r 
~
n 2 
z  z
1
1  1 0 
 and a standard deviation of s z 
t

 z  ln 
,
so
that
. We have

n3
sz
2  1 0 
1  1  0.14782  1
1
z  ln 
r  0.14782 , so ~
  ln 1.34692   0.29782  0.14891 ,
2  1  0.14782  2
2
1  1  0.2  1
1
  ln 1.5  0.40547  0.20273 and s z 
2  1  0.2  2
2
 z  ln 
1
 0.5 . Finally
73
22
252y0781 12/11/07
0.14891  0.20273
 0.10164 . The rejection zone is above t 5 and below  t 5  . If we use
2
2
0.5


5
5
t .005  4.023 or t .025  2.571 , it should be clear that we will not reject the null hypothesis.
t 
g) Don’t forget to hand in your last computer problem. Check here if you did. __________________.
(2 to 7)
[78+]
23
252y0781 12/11/07
5) The manager of a computer network has the following data on the 200 service interruptions that have occurred over the last 100
days.
O
x
0
1
2
3
4
5
6
7
2
51
18
12
11
4
1
1
100
c) A coin is to be tested to see if it is fair. In order to test it the coin is given 5 flips
100 times and the number of heads in 5 flips is recorded at left. This means that there
are a total of 500 flips and the coin has come up heads 255 times. Construct a 99%
confidence interval for the proportion of times it comes up heads. Test the hypothesis
that the proportion is 50% using this interval. (4)
d) The distribution shown here should be a binomial distribution with n  5 and
p  .5 . A more powerful test of the fairness of the coin should be to use
O
x
0
1
2
3
4
5
a) Test to see if these follow a Poisson distribution (6)
b) Use another method to test whether this has a Poisson distribution with a
parameter of 1.8. (5)
3
16
30
29
18
4
probabilities from your cumulative binomial table to check whether this distribution
is correct. (4) [19, 97]
e) Assume that a coin is flipped 20 times and comes up heads half the time. If the sequence of heads and tails is
HHHTTTHHHTTTHHHTTTTH, can we say that the sequence is random? (This is not a yes or no question – I want a statistical test
for randomness! (2)
f) Now assume that there are 5 times as many flips and 5 times as many runs and heads half the time. Can we say that the sequence is
random now? (3) [24, 102]
Solution: a) Since the mean is unknown, we must use a chi-squared test. If there are 200 interruptions in
100 days, we gather that the mean is 2 and use the chi-squared table. The relevant parts follow.
Poisson 1.8
k
0
1
2
3
4
5
6
7
8
9
10
11
12
13
P(x=k)
0.165299
0.297538
0.267784
0.160671
0.072302
0.026029
0.007809
0.002008
0.000452
0.000090
0.000016
0.000003
0.000000
0.000000
Poisson 2.0
P(xk)
0.16530
0.46284
0.73062
0.89129
0.96359
0.98962
0.99743
0.99944
0.99989
0.99998
1.00000
1.00000
1.00000
1.00000
k
0
1
2
3
4
5
6
7
8
9
10
11
12
13
P(x=k)
0.135335
0.270671
0.270671
0.180447
0.090224
0.036089
0.012030
0.003437
0.000859
0.000191
0.000038
0.000007
0.000001
0.000000
Poisson 2.5
P(xk)
0.13534
0.40601
0.67668
0.85712
0.94735
0.98344
0.99547
0.99890
0.99976
0.99995
0.99999
1.00000
1.00000
1.00000
x
If we take the Poisson frequencies and multiply
by n  100 to get our E , we get into big trouble.
None of the values of E for x  5 is above 5.
We will thus create a category for x  5 . (Note
that for accuracy this was done by computer.)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
k
0
1
2
3
4
5
6
7
8
9
10
11
12
13
f
0.135335
0.270671
0.270671
0.180447
0.090224
0.036089
0.012030
0.003437
0.000859
0.000191
0.000038
0.000007
0.000001
0.000000
P(x=k)
0.082085
0.205213
0.256516
0.213763
0.133602
0.066801
0.027834
0.009941
0.003106
0.000863
0.000216
0.000049
0.000010
0.000002
P(xk)
0.08208
0.28730
0.54381
0.75758
0.89118
0.95798
0.98581
0.99575
0.99886
0.99972
0.99994
0.99999
1.00000
1.00000
E  fn
O
13.5335
27.0671
27.0671
18.0447
9.0224
3.6089
1.2030
0.3437
0.0859
0.0191
0.0038
0.0007
0.0001
0.0000
5
21
18
12
11
4
1
1
0
0
0
0
0
0
24
252y0781 12/11/07
The data is put into 6 categories. Since we
estimated a mean from the data, we have 6 – 1 –
1 = 4 degrees of freedom. Because there were no
E cells below 5, the shortcut method was used
to compute
 O2 

  n  136 .589  100  36 .589 .
2 
 E 


According to the Chi-squared table
 .2054   7.8147 . Since the computed value is
x
0
1
2
3
4
5+

O2
O
E
13.5335
27.0671
27.0671
18.0447
9.0224
5.2653
100.0001
E
2
0.2956
51 96.0947
18 11.9703
12
7.9802
11 13.4111
6
6.8372
100 136.569
larger than the table value, we can reject
H 0 : x ~ Poisson .
b) If the mean is known we can use a
Kolmogorov-Smirnov test. The Poisson
table is used to get the expected cumulative
frequency FE . To get the observed
cumulative frequency, FO , add the O
column to get the Cum O column. Then
divide the Cum O column by n  100 . The
D column is the absolute difference
between the FO column and the FE
column. For the K-S test with a sample size
above 40, the 5% critical value is
1.36
 .136 . To my great surprise, we
100
cannot reject the null hypothesis
H 0 : x ~ Poisson1.8 .
x
0
1
2
3
4
5
6
7
8
9
10
FE
.16530
.46284
.73062
.89129
.96359
.98962
.99743
.99944
.99989
.99998
O Cum O
2
51
18
12
11
4
1
1
0
0
2
53
71
83
94
98
99
100
100
100
FO
.02
.53
.71
.83
.94
.98
.99
1.00
1.00
1.00
D
.03470
.06716
.02062
.06129
.02359
.00962
.00743
.00056
.00011
.00002
c) A coin is to be tested to see if it is fair. In order to test it the coin is given 5 flips 100 times and the
number of heads in 5 flips is recorded at left. This means that there are a total of 500 flips and the coin has
come up heads 255 times. Construct a 99% confidence interval for the proportion of times it comes up
heads. Test the hypothesis that the proportion is 50% using this interval. (4) The formula table excerpt is
below.
Interval for
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
Proportion
p  p0
p  p  z 2 s p
pcv  p0  z 2  p
H 0 : p  p0
z

H1 : p  p0
p
pq
p0 q0
sp 
p 
n
n
q  1 p
q0  1  p0
 H 0 : p  .5
Our hypotheses are 
 H 1 : p  .5
p
x 255

 .5100 , z.005  2.576 and s p 
n 500
pq
.510 .490 

n
500
 .00050  .02236 . So p  p  z 2 s p  .5100  2.576 .02236   .5100  .0576 or .4524 to .5576. This
interval (which looks terribly large) includes .5, so we cannot reject the null hypothesis.
25
252y0781 12/11/07
d) The distribution shown here should be a binomial distribution with n  5 and p  .5 . A more powerful
test of the fairness of the coin should be to use probabilities from your cumulative binomial table to check
whether this distribution is correct. (4) [19, 97] H 0 : x ~ Binomial p  .5, n  5 . FE is copied from the
Binomial table.
x
O Cum O
FE
FO
D
0
.03125
3
3
.03 .00125
1
.18750
16
19
.19 .00250
2
.50000
30
49
.49 .01000
3
.81250
29
78
.78 .03250
4
.96875
18
96
.96 .00875
5
1.0000
4
100
1.00 0
100
1.36
1.36
According to the KS table, the 5% critical value for the largest number in D is

 0.136 . But
n
100
no number in D is that large, so we cannot reject the null hypothesis.
e) Assume that a coin is flipped 20 times and comes up heads half the time. If the sequence of heads and
tails is HHHTTTHHHTTTHHHTTTTH, can we say that the sequence is random? (This is not a yes or no
question – I want a statistical test for randomness!) (2)
This is a runs test with n1  10 and n2  10 . We count the runs and find r  7 .
HHH TTT HHH TTT HHH TTTT H
1
2
3
4
5
6
7
The 5% runs test table has a null hypothesis H 0 : Randomness. The critical values given for n1  10 and
n2  10 are 6 and 16. Since the number of runs is between these two numbers, we cannot reject the null
hypothesis.
f) Now assume that there are 5 times as many flips and 5 times as many runs and heads half the time. Can
we say that the sequence is random now? (3) [24, 102] If n1 and n 2 are too large for the table, r follows
2n1 n 2
  1  2
250 50
1 
 1  51 and  2 
n
100
n 1
50 49 
r
35  51

 24 .74747 . If the number of runs is r  57   35 , we have z 


99

24 .74747
the Normal distribution with  
16 2
  10 .34449  3.21629 . A conventional 5% normal test is to fail to reject the null
24 .74747
hypothesis if this value of z is between -1.960 and +1.960. It is not, so we reject the null hypothesis.
Note how the larger sample size has made the runs test more powerful.

26
252y0781 12/11/07
6) Do the following. Use a 1% significance level in this problem!
a) (Multiple choice) I wish to test to see if a distribution is Normal, but I must first use my data to figure out the mean and standard
deviation. I have 100 data points divided into 0 to under 20, 20 to under 40, 40 to under 60, 60 to under 80 and 80 to under 100.
Assume that my expected frequency is 5 or larger for each class. I could use
(i) A chi-squared test with 4 degrees of freedom or a Kolmogorov – Smirnov test.
(ii) A chi-squared test with 2 degrees of freedom or a Kolmogorov – Smirnov test.
(iii) A chi-squared test with 4 degrees of freedom or a Lilliefors test.
(iv) A chi-squared test with 2 degrees of freedom or a Lilliefors test.
(v) Only a Lilliefors test.
(vi) Only a Kolmogorov – Smirnov test.
(vii) Only a chi-squared test.
(2)
b) (Bassett et al) An industrial process is run at 4 different temperatures on four different days. A random sample of 3 units is taken
and scored. The results are as follows. Do the scores differ according to temperature?
100C degrees
120C degrees
140C degrees
160C degrees.
41
54
50
38
44
56
52
36
48
53
48
41
Minitab has computed the following.
Sum of 100C = 133, Sum of 120C = 163, Sum of 140C = 150, Sum of 160C = 115,
Sum of squares of 100C = 5921, Sum of squares of 120C = 8861, Sum of squares of 140C =
7508, Sum of squares of 160C = 4421, Bartlett's Test - Test statistic = 1.22, p-value =
0.748 and Levene's Test - Test statistic = 0.43, p-value = 0.736.
Assume that the scores are not considered to come from the Normal distribution, state your null hypothesis and test it. (5).
c) Assume that the scores are considered to come from the Normal distribution, state your null hypothesis and test it. (6)
d) Why were the Bartlett and Levene tests run? Which of the two is correct here if the underlying distribution is Normal? What do
they tell us? (2) [15]
e) Ignore everything that has gone before. Assume that the Normal distribution applies and test the hypothesis that the mean of the
120C population is larger than the mean of the 100C population. Assume that the underlying distributions are Normal and have equal
variances (4) or assume that the underlying distributions are Normal and do not necessarily have equal variances. (6) Do not do both!
[19, 116]
Solution: a) (Multiple choice) I wish to test to see if a distribution is Normal, but I must first use my data to
figure out the mean and standard deviation. I have 100 data points divided into 0 to under 20, 20 to under
40, 40 to under 60, 60 to under 80 and 80 to under 100. Assume that my expected frequency is 5 or larger
for each class. I could use
(i) A chi-squared test with 4 degrees of freedom or a Kolmogorov – Smirnov test.
(ii) A chi-squared test with 2 degrees of freedom or a Kolmogorov – Smirnov test.
(iii) A chi-squared test with 4 degrees of freedom or a Lilliefors test.
(iv) *A chi-squared test with 2 degrees of freedom or a Lilliefors test.
(v) Only a Lilliefors test.
(vi) Only a Kolmogorov – Smirnov test.
(vii) Only a chi-squared test.
(2)
Since the parameters are not known, we cannot use a Kolmogorov – Smirnov test. We have 5 classes and
have estimated two parameters, so that df  5  1  2  2 for the Chi-squared test.
b) (Bassett et al) An industrial process is run at 4 different temperatures on four different days. A random
sample of 3 units is taken and scored. The results are as follows. Do the scores differ according to
temperature?
100C degrees
120C degrees
140C degrees
160C degrees.
41
44
48
54
56
53
50
52
48
38
36
41
Assume that the scores are not considered to come from the Normal distribution, state your null hypothesis
and test it. (5). If these scores are a random sample, we must use the Kruskal-Wallis Test with
H 0 : 1   2   3   4 . Replace the numbers by their ranks.
x1
41
44
48
r1
x2
r2
3.5
5
6.5
15.0
54
56
53
11
12
10
33
x3 r3
x4
r4
50
52
48
38
36
41
2
1
3.5
6.5
8
9
6.5
23.5
27
252y0781 12/11/07
We must check to see that 15 .0  33  23 .5  6.5  78 
 12
H 
 nn  1

i
 SRi 2

 ni

12 13 
. Now, compute the Kruskal-Wallis statistic
2
2
2
2
2 


  3n  1   12  15   33   23 .5  6.5   313   1  1908   39

13  3 
3
3
3 
12 13   3

23
 9.9231 . This is too large for the Kruskal-Wallis table and should be compared with  .01  11 .3449 .
Since our computed statistic is smaller than the table value, do not reject the null hypothesis of equal
medians.
c) Assume that the scores are considered to come from the Normal distribution, state your null hypothesis
and test it. (6)
100C degrees
120C degrees
140C degrees
160C degrees.
41
44
48
54
56
53
50
52
48
38
36
41
Minitab has computed the following.
Sum of 100C = 133, Sum of 120C = 163, Sum of 140C = 150, Sum of 160C = 115,
Sum of squares of 100C = 5921, Sum of squares of 120C = 8861, Sum of squares of
140C = 7508, Sum of squares of 160C = 4421, Bartlett's Test - Test statistic =
1.22, p-value = 0.748 and Levene's Test - Test statistic = 0.43, p-value =
0.736.
Process
2
54
56
53
1
41
44
48
Sum
133 +
163 +
Temp
3
50
52
48
4
38
36
41
Sum
150 +
115
 561 
3
 12  n
nj
3+
3+
3+
x j
44.333
54.333
50.000
38.333
SS
5921 +
8861 +
7508 +
4421`
x 2j
1965.44 +
2952.07 +
2500.00 +
1469.44
ij
561
 46 .750  x
12
 26711 
x ij2

 8886 .95   x
2
j
2   xij2  nx 2  26711 1246.750 2  484 .25
2
2
2
2
2
2
2
2
. j  x    n j x. j  nx  344 .333   354 .333   350   338 .333   12 46 .750 
 x
SSB   x
SST 
 x
ij
x
 38886 .95   12 46 .75 2  26660 .85  26226 .75  434 .10
Source
SS
DF
MS
Between
434.10
3
144.7
Within
Total
50.15
484.25
8
11
6.2688
F
F.01
23.08s
F 3,8  7.59
H0
Column means equal
d) Why were the Bartlett and Levene tests run? Which of the two is correct here if the underlying
distribution is Normal? What do they tell us? (2) [15] Since we do an ANOVA on the assumption that the
underlying data is Normal, we should use the Bartlett test results. The null hypothesis is equal variances,
which is another requirement of ANOVA. The extremely high p-value means that we cannot doubt the null
hypothesis.
28
252y0781 12/11/07
e) Ignore everything that has gone before. Assume that the Normal distribution applies and test the
hypothesis that the mean of the 120C population is larger than the mean of the 100C population. Assume
that the underlying distributions are Normal and have equal variances (4) or assume that the underlying
distributions are Normal and do not necessarily have equal variances. (6) Do not do both! [19, 116] The
formulas for the methods requested are below.
Interval for
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
Difference
H 0 : D  D0 *
d cv  D0  t  2 s d
D  d  t 2 s d
d  D0
t
between Two
H 1 : D  D0 ,
sd
1
1
Means (
sd  s p

D  1   2
n  1s12  n2  1s22
n1 n2
unknown,
sˆ 2p  1
n1  n2  2
variances
DF  n1  n2  2
assumed equal)
H 0 : D  D0 *
D  d  t 2 s d
Difference
between Two
Means(
unknown,
variances
assumed
unequal)
s12 s22

n1 n2
sd 
DF 
 s12 s22 
  
n

 1 n2 
H 1 : D  D0 ,
t
D  1   2
d cv  D0  t  2 s d
d  D0
sd
2
   
s12
2
n1
n1  1
s 22
2
n2
n2  1
Minitab has computed the following.
Sum of 100C = 133, Sum of 120C = 163, Sum of squares of 100C = 5921, Sum of
squares of 120C = 8861. To summarize n1  n 2  3 .
and

x 22
x

1
 133,
x
2
1
 5921,
x
2
 163
 8861. We have already computed x1  44.333 and x 2  54.333, so that d  x1  x 2  10
We now need
s 22
x
2
2
s12
x

 n 2 x 22
n2 1
2
1
 n1 x12
n1  1


5921  344 .333 2
 12 .3777
2
s1  3.5182  and
8861  354 .333 2
 2.3877 s 2  1.5452  . If we assume equal variances, we get
2
n  1s12  n2  1s 22 12.3777  2.3877
1 
  1



 7.3827 . This means s d  s p2  
s p2  1
2
n1  n 2  2
 n1 n 2 
1 1
 7.3827     4.9218  2.2185 . Our degrees of freedom are n1  n 2  2  3  3  2  4 and we will
3 3
 H 0 : 1   2
H 0 : D  0
4
 3.747 for a left-sided test. We are testing 
use  t .01
or, if D  1   2 , 
. If
 H 1 : 1   2
H 1 : D  0
we use a t-ratio we get t 
d  D0
sd

10  0
4
 4.5076 . Since this is below  t .01
, we reject the null
2.2185
hypothesis. If we use a critical value for d , we need a single value below zero. d cv  D0  t  2 s d becomes
d cv  D0  t s d  0 - 3.747 2.2185   8.313 . Since d  x1  x 2  10 is below this number we reject
the null hypothesis.
If we do not assume that the variances are equal, we need to do some arithmetic with our standard errors.
29
252y0781 12/11/07
s x21 
s x22 
sd 
s12 12 .3777

 4.1259
n1
3
s 22 2.3877

 0.7959
n2
3
s12 s 22
=

n1 n 2
. These sum to
s12 s 22

 4.1259  0.7959  4.9218 . This means that
n1 n 2


  s2 s2 2
  1  2 
  n1 n 2 
4.9218  2.2185 and that df  
2
2
  s2 
 s 22 
1 




 n2 
  n1 
 


n

1
n2 1
 1













4.9218 2
 
2
2
 4.1259   0.7959 
2
2



24 .2241
24 .2241

 8.51153  0.31673  8.82826  2.7439 which is rounded down to 2.


H :    2
2
and we will use  t .01
or, if D  1   2 ,
 6.965 for a left-sided test. We are testing  0 1
 H 1 : 1   2
H 0 : D  0
d  D0
. If we use a t-ratio we get t 

H
:
D

0
sd
 1

10  0
2
 4.5076 . Since this is not below  t .01
,
2.2185
we cannot reject the null hypothesis. If we use a critical value for d , we need a single value below zero.
d cv  D0  t  2 s d becomes d cv  D0  t s d  0 - 6.965 2.2185   15.452 . Since d  x1  x 2  10 is
not below this number we cannot reject the null hypothesis.
If we do not assume that the variances are equal, we need to do some arithmetic with our standard errors
30
252y0781 12/11/07
7) The following are tests of proportions. (Bassett et al). You must do legitimate tests at the 10% significance level.
a) Is there any association between Forecasted and observed rainfall? 173 forecasts are considered.
Observed Rainfall No rain Forecasted
None
34
Light Rain
21
Heavy Rain
23
Light Rain Forecasted
24
4
9
Heavy Rain Forecasted.
17
3
38
State your null and alternative hypotheses and test it. (7)
b) Are there significant differences in the proportions of female insects in 3 different locations?
In location 1, 44% of 100 bugs are female. In location 2, 43% of 200 bugs are female. In location 3, 55% of 200 bugs are female.
First test to see if there is a significant difference between the proportions in locations 1 and 2. (4)
c) In b, test whether proportions of females are independent of location using all three proportions. (5) [16, 132]
Solution:
a) Is there any association between Forecasted and observed rainfall? 173 forecasts are considered.
Observed Rainfall No rain Forecasted
None
34
Light Rain
21
Heavy Rain
23
Light Rain Forecasted
24
4
9
Heavy Rain Forecasted.
17
3
38
State your null and alternative hypotheses and test it. (7)
This is a chi-squared test of independence or homogeneity. The null hypothesis is homogeneity.
We repeat the data as our O table. The row sums are made into proportions in the ‘rp’ column.
O Row Forecast Onone Olight Oheavy
Sum
rp
1
2
3
None
Light
Heavy
Sum
34
21
23
89
24
4
9
37
17
3
38
58
75
28
70
173
0.433526
0.161850
0.404624
1.000000
The E table is gotten by applying row proportions to the column totals.
Elight
Eheavy
Sum
E Row Forecast Enone
1
2
3
None
Light
Heavy
33.8150
12.6243
31.5607
16.0405
5.9884
14.9711
Because none of the expected cells are below 5,
 O2 

n
the shortcut formula  2 
 E 


 203 .589  173  30.589 is used to calculate the
test statistic. The degrees of freedom are
3  13  1  4 . The 5% table value of chisquared with 4 degrees of freedom is 9.4877, but
you were supposed to use the 10% value, 7.7794
Since our computed chi-square is above the table
value, we reject the null hypothesis and conclude
that there is some association between the
forecast and observed weather.

25.1445
9.3873
23.4682
Row
1
2
3
4
5
6
7
8
9
75
28
70
O
E
O2
E
34 33.8150 34.1860
21 12.6243 34.9327
23 31.5607 16.7614
24 16.0405 35.9092
4
5.9884
2.6718
9 14.9711
5.4104
17 25.1445 11.4936
3
9.3873
0.9587
38 23.4682 61.5300
173 173.0000 203.854
31
252y0781 12/11/07
b) Are there significant differences in the proportions of female insects in 3 different locations?
In location 1, 44% of 100 bugs are female. In location 2, 43% of 200 bugs are female. In location 3, 55% of
200 bugs are female. First test to see if there is a significant difference between the proportions in locations
1 and 2. (4) Behold, the usual excerpt from the formula table.
Interval for
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
pcv  p0  z 2  p
Difference
p  p 0
p  p  z 2 s p
H 0 : p  p 0
z
between
If
p  0
 p
H 1 : p  p 0
p  p1  p 2
proportions
 1
1 
If p  0
p 0  p 01  p 02
 p  p 0 q 0   
p1 q1 p 2 q 2
q  1 p
n
n
s p 

2 
 1
p 01q 01 p 02 q 02
 p 

n1
n2
or p 0  0
n1
n2
n1 p1  n 2 p 2
p0 
Or use s p
n1  n 2
Our null hypothesis is p1  p 2 . Again, I’ll skip the confidence interval. Our facts are n1  100 , p1  .44 ,
n 2  200 , p 2  .43 and p  p1  p 2  .01 .   .10 . I’m in a hurry so we will say p 0 
44  86
100  200
 1
1 
1 
 1
  .43333 .56667 

p 0 q 0  
  .0036833
 100 200 
 n1 n 2 
p  p 0 .01  0
 0.06069 . This is a 2-sided test, so I will use z .05  1.645 . The test ratio is z 

 p
.06069
 .43333 . The standard error is  p 
 .1648 . We do not reject if our value of z is between z .05  1.645 . It isn’t. We reject the null. If we
use a critical value, we need pCV  p0  z  p  0  1.645 .06069   .0998 . As long as p  p1  p 2
2
is between these values we cannot reject the null hypothesis. .01 is in the ‘reject’ region. This was very
close.
c) In b, test whether proportions of females are independent of location using all three proportions. (5) [16,
132] Our facts are n1  100 , p1  .44 , n 2  200 , p 2  .43 , n3  200 and p 3  .55 . If anyone tried to use
p1  p 2  p3 , I can only say that I warned you – what does this expression equal when all the ps are equal
to, say, .3? Tests involving more than 2 proportions are chi-square tests. Our observed table is gotten by
multiplying sample sizes by observed proportions.
O Row Forecast Loc 1
Loc 2
Loc 3
Sum
rp
1
2
Female
Male
Sum
E Row Forecast
1
2
Female
Male
Sum
44
56
100
86
114
200
110
90
200
240
260
500
Loc 1
48
52
100
Loc 2
96
104
200
Loc 3
96
104
200
Sum
rp
240 0.480
260 0.520
500 1.000
Because none of the expected cells are below 5,
 O2 

n
the shortcut formula  2 
 E 


 507 .571  500  7.571 is used to calculate the
test statistic. The degrees of freedom are
3  12  1  3 . The 10% table value of chisquared with 3 degrees of freedom is 6.2514.
Since our computed chi-square is above the table
value, we reject the null hypothesis and must
conclude that there is a significant difference
between the proportions of females in each
location.

0.480
0.520
1.000
Row
O
E
1
2
3
4
5
6
44
56
86
114
110
90
500
48
52
96
104
96
104
500
O2
E
40.333
60.308
77.042
124.962
126.042
77.885
506.571
32
252y0781 12/11/07
8) The following are odds and ends that don’t fit anywhere else. We are selling our production in an
imperfect market. x1 is the number of units produced and x2 is our revenue. r1 and r2 are the ranks of the
items in x1 and x2.   .05 
Row
1
2
3
4
5
6
7
8
9
10
x1
330
263
428
584
423
219
308
123
173
140
x2
221
194
245
243
244
171
213
108
143
120
r1
7
5
9
10
8
4
6
1
3
2
r2
7
5
10
8
9
4
6
1
3
2
Minitab has computed the following: sum of x1
= 2991,
squares
squares
x1 x2 =
Sum of x2 = 1902, Sum of
of x1 = 1088721, Sum of
of x2 = 386210 and Sum of
767524.
a) Test x1 to see if its median is 200. Do not use the sign test or compute any medians. (4)
b) Assuming that x1 and x2 are both random samples from a nonnormal distribution, test to see if they have similar medians. (4)
c) Compute the correlation between x1 and x2 and the rank correlation between them. Why is the rank correlation higher? (6)
d) Test the rank correlation for significance. (2) [16]
Solution: a) Test x1 to see if its median is 200. Do not use the sign test or compute any medians. (4)
Do a Wilcoxon signed rank test with the alleged
Row x1
x 2 d  x 2  x1
r
d
median in place of x 2 . The null hypothesis is
H 0 :  200 . Compute the two totals, T+ = 43
and T- = 12. Check that these sum to
10 11
 55 . Since this is a 2-sided test, look up
2
a 2.5% critical value for n  10 . the critical value
is 8. If the smaller of the 2 Ts were less than or
equal to 8, we would reject the null hypothesis.
Since it is not we do not reject the null
hypothesis.
1
2
3
4
5
6
7
8
9
10
330
263
428
584
423
219
308
123
173
140
200
200
200
200
200
200
200
200
200
200
30
63
282
384
223
19
108
-77
-27
-60
30
63
282
384
223
19
108
77
27
60
3
5
9
10
8
1
7
6
2
4
r*
3
5
9
10
8
1
7
-6
-2
-4
b) Assuming that x1 and x2 are both random samples from a nonnormal distribution, test to see if they have
similar medians. (4)
This is a good old Wilcoxon-Mann Whitney test.
Row
x1
x2
r1
r2
Rank the data within the whole sample (from 1 to
1
330
221
17.
11.
20 here). The two rank sums are 129 and 81.
2 263 194
15.
8.
(We should check to see that they add to the sum
3 428 245
19. 14.
4 584 243
20. 12.
20 21
 210 ). We do not
of the first 20 ranks
5 423 244
18. 13.
2
6 219 171
10.
6.
have a table that covers samples as large as those
7 308 213
16.
9.
here. W , the smaller of the two rank sums has
8 123 108
3.
1.
9 173 143
7.
5.
the normal distribution with mean
10 140 120
4.
2.
1
W  2 n1 n1  n2  1  0.510 21  105 and
129.0 81.0
variance  W2 
means z 
1
6 n2 W
 16 10105  175 . This
81  105
 1.814 . Since this is
175
between 1.96 we cannot reject the null
hypothesis of equal medians.
33
252y0781 12/11/07
c) Compute the correlation between x1 and x2 and the rank correlation between them. Why is the rank
correlation higher? (6)
I have already done one correlation. With the
sums given here you should be able to find that
the correlation is .913. The formula for rank
6 d2
correlation is rs  1 
. Using what was
n n2 1
given, we get the table to the right. We have
64 
24
rs  1 
 1
 .97575 . This is
2
1099 
10 10  1

 


higher because the relationship between the two
variables has a slight curvature.
Row
1
2
3
4
5
6
7
8
9
10
x1
330
263
428
584
423
219
308
123
173
140
x2
221
194
245
243
244
171
213
108
143
120
r1
7
5
9
10
8
4
6
1
3
2
r2
7
5
10
8
9
4
6
1
3
2
d
d2
0
0
-1
2
-1
0
0
0
0
0
0
0
0
1
2
1
0
0
0
0
0
4
d) Test the rank correlation for significance. (2) [16] The table says that for n  10 , the 2.5% critical value
H 0 :  s  0
is .6364 and the 5% critical value is .5515. This means that if our hypotheses are either 
or,
H 1 :  s  0
H 0 : s  0
more likely, 
we reject the null hypothesis and say that the rank correlation is significant.
 H1 :  s  0
34
Download