Document 15930404

advertisement
5/01/03 252y0341
Introduction
This is long, but that’s because I gave a relatively thorough explanation of everything that I did.
When I worked it out in the classroom, it was much, much shorter.
The easiest sections were probably 1a, 2a, 2b, 2c, 7b, 7c, and 7d in Part I. There are some very
easy sections in part I if you just follow suggestions and look at p-values and R-squared.
After doing the above, I might have computed the spare parts in 3b in Part II.
Spare Parts Computation:
SSx 
x 2  nx 2  990401  11260 .636 2
x 2867
 243158 .6
x

 260 .636
n
11
Sxy 
xy  nx y  4369154  11260 .636 1189 .36 



 y  13083  1189 .36
y
n
 959263 .6
11
SSy 
y
2
ny 2  20022545  111189 .36 2
 4462195 .7
And not recomputed them every time I needed them in Problems 4 and 5 ! Only then would I have tried the
multiple regression.
Many of you seem to have no idea what a statistical test is. We have been doing them every day. The most
common examples of this were in part II.
7
10
 .06087 and p 2 
 .09709 . Hey look! They’re different! Whoopee! And you
Problem 1b: p 3 
115
103
think that you will get credit for this? Whether these two proportions come from the same population or
not, chances are they will be somewhat different, you need one of the three statistical tests shown in the
solution to show that they are significantly different.
Problem 3c: Many of you started with


s x2
x

2
 nx 2
n 1

990401  11260 .363 2 243158 .6

 24315 .86
10
10
s  24315 .86  155 .90 This is fine and you got some credit for knowing how to compute a sample
variance, though you probably had already computed the numerator somewhere else in this exam. But then
you told me that this wasn’t 200. Where was your test?:
5/01/03 252y0341
ECO252 QBA2
FINAL EXAM
May 6, 2003
Name
KEY
Hour of Class Registered (Circle)
I. (18 points) Do all the following. Note that answers without reasons receive no credit.
A researcher wishes to use demographic information to predict sales of a large chain of nationwide sports
stores. The researcher assembles the following data for a random sample of 38 stores. Use   .10 in this
problem. That’s why p-values above .10 mean that the null hypothesis of insignificance is not rejected.
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Sales
1695713
3403862
2710353
529215
663687
2546324
2787046
612696
891822
1124968
909501
2631167
882973
1078573
844320
1849119
3860007
826574
604683
1903612
2356808
2788572
634878
2371627
2627838
1868116
2236797
1318876
1868098
1695219
2700194
1156050
643858
2188687
830352
1226906
566904
826518
Age
33.1574
32.6667
35.6553
33.0728
35.7585
33.8132
30.9797
30.7843
32.3164
32.5312
31.4400
33.1613
31.8736
33.4072
34.0470
28.8879
36.1056
32.8083
33.0538
33.4996
32.6809
28.5166
32.8945
30.5024
30.2922
31.2911
33.0498
32.9348
31.8381
31.0794
32.1807
31.6944
34.0263
34.7315
30.5613
33.5183
32.3952
29.9108
Growth
Income
0.8299
0.6619
0.9688
0.0821
0.4646
2.1796
1.8048
-0.0569
-0.1577
0.3664
2.2256
1.5158
0.1413
-1.0400
1.6836
2.3596
0.7840
0.1164
1.1498
0.0606
1.6338
1.1256
1.4884
4.7937
1.8922
1.8667
1.7896
0.2707
3.0129
23.4630
0.7041
-0.1569
0.7084
0.1353
0.3848
0.7417
0.6693
0.1111
26748.5
53063.8
36090.1
32058.1
47843.4
50181.0
30710.1
29141.7
25980.2
18730.9
31109.2
35614.1
23038.4
34531.7
30350.4
38964.9
49392.8
25595.7
29622.6
31586.1
39674.6
28879.0
24287.1
46711.2
33449.8
31694.5
25459.2
47047.3
26433.2
33396.7
26179.4
33454.6
42271.5
46514.8
27030.8
42910.1
40561.4
22326.0
HS
73.5949
88.4557
73.5362
79.1780
84.1838
93.4996
78.0234
70.2949
70.6674
63.7395
76.9059
82.9452
65.2127
73.4944
80.2201
87.5973
85.3041
65.5884
80.6176
80.3790
79.8526
81.2371
70.2244
87.1046
80.2057
75.2914
77.6162
85.1753
74.1792
81.6991
73.4140
73.7161
78.6493
80.9503
66.8057
77.8905
79.3622
58.3610
College
17.8350
31.9439
18.6198
20.6284
35.2032
41.7057
28.0250
15.0882
10.9829
13.2458
19.5500
20.8135
16.9796
32.9920
22.3185
24.5670
30.8790
17.4545
18.6356
38.3249
23.7780
16.9300
19.1429
30.8843
26.5570
28.3600
19.2490
35.4994
18.6375
41.1130
17.8566
26.5426
29.8734
24.5374
14.1390
20.8340
19.0309
10.6729
In the data above ‘Sales’ is the total sales in the last month, ‘Age’ is the median customer age,
‘Growth’ is the population growth rate in the last ten years, ‘Income’ is median family income,
‘HS’ is percent (Don’t forget that it is in per cent) of potential customers with a high school diploma,
‘College’ is percent of potential customers with a college degree.
To start with, the researcher runs ‘sales’ against each independent variable individually with the following
results. Comments that answer the question on Page 3 appear in red.
MTB > regress c1 1 c2
Regression Analysis: Sales versus Age
The regression equation is
Sales = 931626 + 21783 Age
Predictor
Coef
Constant
931626
Age
21783
slope is insignificant.
SE Coef
2851421
87750
T
0.33
0.25
P
0.746
0.805
The p-value above .10 shows us that the
5/01/03 252y0341
2
S = 919493
R-Sq(adj) = 0.0% Extremely low R-sq – Little explanation of Y.
R-Sq = 0.2%
Analysis of Variance
Source
DF
SS
MS
Regression
1 52099324721 52099324721
Residual Error
36 3.04368E+13 8.45467E+11
Total
37 3.04889E+13
F
0.06
P
0.805
Same p-value, same conclusion.
Unusual Observations
Obs
Age
Sales
Fit
SE Fit
Residual
St Resid
17
36.1
3860007
1718106
353724
2141901
2.52R
22
28.5
2788572
1552797
376045
1235775
1.47 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
MTB > regress c1 1 c3
Regression Analysis: Sales versus Growth
The regression equation is
Sales = 1595571 + 26834 Growth
Predictor
Coef
SE Coef
T
P
Constant
1595571
161301
9.89
0.000
Growth
26834
39601
0.68
0.502
The p-value above .10 shows us that the
slope is insignificant.
S = 914467
R-Sq = 1.3%
R-Sq(adj) = 0.0% Extremely low R-sq – Little explanation of Y.
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
1 3.83946E+11 3.83946E+11
36 3.01050E+13 8.36249E+11
37 3.04889E+13
F
0.46
P
0.502 Same p-value, same conclusion.
Unusual Observations
Obs
Growth
Sales
Fit
SE Fit
Residual
St Resid
17
0.8
3860007
1616609
151819
2243398
2.49R
30
23.5
1695219
2225167
878449
-529948
-2.09RX
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
MTB > regress c1 1 c4
Regression Analysis: Sales versus Income
The regression equation is
Sales = 299877 + 39.2 Income
Predictor
Constant
Income
slope is significant.
S = 849860
Coef
299877
39.17
SE Coef
554447
15.71
R-Sq = 14.7%
T
0.54
2.49
P
0.592
0.017
The p-value below .10 shows us that the
R-Sq(adj) = 12.3% Small R-sq compared with HS regression.
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
1 4.48747E+12 4.48747E+12
36 2.60014E+13 7.22262E+11
37 3.04889E+13
Unusual Observations
Obs
Income
Sales
17
49393
3860007
Fit
2234579
SE Fit
276038
F
6.21
P
0.017
Residual
1625428
St Resid
2.02R
R denotes an observation with a large standardized residual
3
5/01/03 252y0341
MTB > regress c1 1 c5
Regression Analysis: Sales versus HS
The regression equation is
Sales = - 2969741 + 59660 HS
Predictor
Constant
HS
significant.
Coef
-2969741
59660
S = 802004
SE Coef
1370956
17669
R-Sq = 24.1%
T
-2.17
3.38
P
0.037
0.002 The p-value below .10 shows us that the slope is
R-Sq(adj) = 21.9% Best R-sq of the lot.
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
1 7.33335E+12 7.33335E+12
36 2.31556E+13 6.43210E+11
37 3.04889E+13
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
38
58.4
826518
Fit
2119509
512081
F
11.40
SE Fit
192928
358068
P
0.002
Residual
1740498
314437
St Resid
2.24R
0.44 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
MTB > regress c1 1 c6
Regression Analysis: Sales versus College
The regression equation is
Sales = 789847 + 35854 College
Predictor
Constant
College
S = 871330
Coef
789847
35854
SE Coef
439508
17582
R-Sq = 10.4%
T
1.80
2.04
P
0.081
0.049
R-Sq(adj) = 7.9%
Significant if alpha is .10.
Small R-sq compared with HS regression.
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
1 3.15714E+12 3.15714E+12
36 2.73318E+13 7.59216E+11
37 3.04889E+13
Unusual Observations
Obs
College
Sales
6
41.7
2546324
17
30.9
3860007
Fit
2285170
1896988
SE Fit
347197
189865
F
4.16
P
0.049
Residual
261154
1963020
St Resid
0.33 X
2.31R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
1. On the basis of the material above the researcher decides that ‘HS’ is the best single predictor of sales.
Please explain why. Consider the values of R 2 and the significance tests on the slope of the equation.
According to the equation showing the response of sales to HS, how much will sales rise if there is a 1%
increase in people with a high school diploma. The average store has sales of $1638487. Relative to this,
what percent increase in sales would be caused by a 1 (per cent) increase in ‘HS’. (4)
Solution: We could rule out some of these because of pathetically low R-sq or insignificant coefficients.
The HS regression has the highest R-sq and a significant coefficient for the independent variable. Sales go
up by 59660 for each rise of 1 in HS. This is about 3.7%.
4
5/01/03 252y0341
2. The researcher tries to improve the prediction by adding another variable. Since there were 4 other
variables than ‘HS,’ there are four regressions below. Do any of them represent an improvement on ‘HS’
alone? Why? Look at the significance tests on the coefficients of the new variables and the adjusted R 2 . In
order to put this in perspective, the average values of the independent variables are shown below.
Age
Growth
Income
College
HS
32.450
1.599
34175
23.67
77.24
Take the best of the four regressions below and give the value of sales that would be predicted for a store
with average value of the independent variables and explain by what percent sales would rise if ‘HS’ went
up by 1. How much does this differ from the prediction using ‘HS’ alone. (4)
MTB > regress c1 2 c5 c2
Regression Analysis: Sales versus HS, Age
The regression equation is
Sales = - 2126081 + 60953 HS - 29076 Age
Predictor
Constant
HS
Age
Coef
-2126081
60953
-29076
S = 811809
SE Coef
2678378
18226
78952
R-Sq = 24.3%
T
-0.79
3.34
-0.37
P
0.433
0.002
0.715
Significant – way below 10 per cent.
Insignificant.
R-Sq(adj) = 20.0%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
HS
Age
DF
SS
MS
2 7.42273E+12 3.71137E+12
35 2.30662E+13 6.59034E+11
37 3.04889E+13
F
5.63
P
0.008
DF
Seq SS
1 7.33335E+12
1 89382656452
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
Fit
2023658
SE Fit
325389
Residual
1836350
St Resid
2.47R
R denotes an observation with a large standardized residual
MTB > regress c1 2 c5 c3
Regression Analysis: Sales versus HS, Growth
The regression equation is
Sales = - 2959336 + 59494 HS + 1506 Growth
Predictor
Constant
HS
Growth
Coef
-2959336
59494
1506
S = 813360
SE Coef
1412551
18355
36079
R-Sq = 24.1%
T
-2.10
3.24
0.04
P
0.043
0.003
0.967
Significant
Insignificant!.
R-Sq(adj) = 19.7%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
2 7.33450E+12 3.66725E+12
35 2.31544E+13 6.61555E+11
37 3.04889E+13
F
5.54
P
0.008
5
5/01/03 252y0341
Source
HS
Growth
DF
Seq SS
1 7.33335E+12
1 1152089260
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
30
81.7
1695219
Fit
2116944
1936614
SE Fit
205088
786380
Residual
1743063
-241395
St Resid
2.21R
-1.16 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
MTB > regress c1 2 c5 c4
Regression Analysis: Sales versus HS, Income
The regression equation is
Sales = - 3089379 + 62540 HS - 3.0 Income
Predictor
Constant
HS
Income
Coef
-3089379
62540
-3.01
SE Coef
1715239
30098
25.26
T
-1.80
2.08
-0.12
P
0.080
0.045
0.906
Insignificant.
S = 813216
R-Sq = 24.1%
R-Sq(adj) = 19.7% This is the highest R-sq. But if Sales go up by 62540 for
a rise of 1 in HS, the increase of 3.8% isn’t much of a change from our previous result.
Analysis of Variance
Source
Regression
Residual Error
Total
Source
HS
Income
DF
SS
MS
2 7.34273E+12 3.67136E+12
35 2.31462E+13 6.61320E+11
37 3.04889E+13
F
5.55
P
0.008
DF
Seq SS
1 7.33335E+12
1 9375516635
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
Fit
2096954
SE Fit
272313
Residual
1763053
St Resid
2.30R
R denotes an observation with a large standardized residual
MTB > regress c1 2 c5 c6
Regression Analysis: Sales versus HS, College
The regression equation is
Sales = - 3193739 + 64448 HS - 6161 College
Predictor
Constant
HS
College
Coef
-3193739
64448
-6161
S = 812572
SE Coef
1627759
25486
23343
R-Sq = 24.2%
T
-1.96
2.53
-0.26
P
0.058
0.016
0.793
Insignificant.
R-Sq(adj) = 19.9%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
HS
College
DF
SS
MS
2 7.37935E+12 3.68967E+12
35 2.31096E+13 6.60273E+11
37 3.04889E+13
F
5.59
P
0.008
DF
Seq SS
1 7.33335E+12
1 45998746314
Unusual Observations
Obs
HS
Sales
Fit
SE Fit
Residual
17
85.3
3860007
2113692
196709
1746316
R denotes an observation with a large standardized residual
St Resid
2.22R
6
5/01/03 252y0341
3. In desperation the researcher tries to add all the variables at once.
a. What does the ANOVA show? (2)
b. Do any of the coefficients have a wrong sign? (Remember there is nothing wrong with a
negative coefficient unless you can give a reason why it shouldn’t be negative) (1)
c. Which of the coefficients are significant? (2) Lots of you answered this, few of you gave
reasons, and many of you seem to have thought that every positive coefficient was significant.
d. Do an F test to show if addition of all the variables improved the regression. To do this drop a
few zeros. Take the Regression Sum of squares in the regression with ‘HS’ alone as 7.333, the
regression sum of squares after adding all the new variables as 7.454 and the error sum of squares
as 23.303. (This should have read 23.034 but it wouldn’t change anything) I’m getting
this from the
ANOVA table below and the sequential SS table below it by dividing all the SS’s by
10 12 since only their relative size matters. (3)
e. To put the results in perspective try again to predict the sales that a store with the mean values
of the independent variables would have and what percent increase in sales would come from an
increase of 1 in ‘HS.’ How does this compare with our prediction when we used ‘HS’ alone?
f. The column marked VIF (variance inflation factor) is a test for (multi)collinearity. The rule of
thumb is that if any of these exceeds 5, we have a multicollinearity problem. None does. What is
multicollinearity and why am I worried about it? (2)
Solution: a) The p-value for the ANOVA is .095 indicating that we can reject the null hypothesis of no
explanatory power for the regression at 10%, but not 5%.
b) The negative coefficient on College doesn’t look very reasonable do we really think that being a college
graduate cuts your demand for athletic equipment? It seems that many of you thought that the coefficient of
age could not be negative because age can’t be negative. If you had been thinking you would have realized
that areas with older residents might buy less sports equipment.
c) HS is significant at the 10% level because its p-value is below 10%.
d)
ANOVASource
SS
DF
MS
F
HS
7.333 1
Others 0.121 4
Error 23.034 32
Total 30.049 37
7.333 11.78
.03025 0.05
0.6225
If you check the F table for 1 and 32 DF, the 5% value is 4.15 and the 1% value is 7.50, so we would have
to accept the null hypothesis of no improvement.
e) The regression says
Sales = - 2270706 + 62735 HS - 27384 Age - 5702 College + 2.4 Income + 2084 Growth
=-2270706 + 62735(77.34) – 27384(32.450) – 5702(23.67) + 2.4(34175) + 2084(1.599)
=-2270706 + 4845651 – 888611 – 134966 + 82020 + 3332 = 1636720. Still about 3.8%.
f) Multicollinearity is close correlation between the independent variables and makes accurate values of the
coefficients hard to get.
MTB > regress c1 5 c5 c2 c6 c4 c3;
SUBC> vif.
Regression Analysis: Sales versus HS, Age, College, Income, Growth
The regression equation is
Sales = - 2270706 + 62735 HS - 27384 Age - 5702 College + 2.4 Income + 2084 Growth
Predictor
Constant
HS
Age
College
Income
Growth
S = 848433
Coef
-2270706
62735
-27384
-5702
2.45
2084
SE Coef
3696533
35090
93046
28359
30.53
44098
R-Sq = 24.4%
T
-0.61
1.79
-0.29
-0.20
0.08
0.05
P
0.543
0.083
0.770
0.842
0.937
0.963
VIF
3.5
1.3
2.7
3.8
1.4
R-Sq(adj) = 12.6%
7
5/01/03 252y0341
Analysis of Variance
Source
Regression
Residual Error
Total
Source
HS
Age
College
Income
Growth
DF
SS
MS
5 7.45407E+12 1.49081E+12
32 2.30348E+13 7.19839E+11
37 3.04889E+13
F
2.07
P
0.095
DF
Seq SS
1 7.33335E+12
1 89382656452
1 26200610077
1 3524785887
1 1608397623
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
30
81.7
1695219
Fit
2038662
1899886
SE Fit
360453
826437
Residual
1821346
-204667
St Resid
2.37R
-1.07 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
8
5/01/03 252y0341
II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where
applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing
appropriate statistical tests. Remember: 1) Data must be in order for Lilliefors. Make sure that you do not
cross up x and y in regressions.
1. (Berenson et. al. 1220) A firm believes that less than 15% of people remember their ads. A survey is
taken to see what recall occurs with the following results (In these problems calculating proportions won’t
help you unless you do a statistical test):
Medium
Mag
TV
Radio Total
Remembered
25
10
7
42
Forgot
73
93
108
274
Total
98
103
115
316
a. Test the hypothesis that the recall rate is less than 15% by using proportions calculated from the
‘Total’ column. Find a p-value for this result. (5)
b. Test the hypothesis that the proportion recalling was lower for Radio than TV. (4)
c. Test to see if there is a significant difference in the proportion that remembered according to the
medium. (6)
d. The Marascuilo procedure says that if (i) equality is rejected in c) and
 
(ii) p 2  p3   2 s p , where the chi – squared is what you used in c) and the standard deviation is
2
what you would use in a confidence interval solution to b), you can say that you have a significant
difference between TV and Radio. Try it! (5)
Solution: I have never seen so many people lose their common sense as did on this problem. Many of you
seemed to think that the answer to c) was the answer to a) in spite of the fact that .15 appeared nowhere in
your answer. An A student tried at one point to compare the total fraction that forgot with the total fraction
that remembered, even though the method she used was intended to compare fractions of two different
groups and the two fractions she compared could only have been the same if they were both .5, since they
had to add to one.
a) From the formula table.
Interval for
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
Proportion
p  p0
p  p  z 2 s p
pcv  p0  z 2  p
H 0 : p  p0
z

H
:
p

p
p
1
0
pq
p0 q0
sp 
p 
n
n
q  1 p
q0  1  p0
H 1 : p  .15 It is an alternate hypothesis because it does not contain an equality. The null
hypothesis is thus H 0 : p  .15. Initially, assume   .05 and note than n  316 , x  42 so that
p0 q0
x
42
.15.85 

 .1329 .  p 

 .0004075  .02019 . This is a one-sided test and
n 316
n
316
z  z .05  1.645 . This problem can be done in one of three ways.
p
9
5/01/03 252y0341
(i) The test ratio is z 
p  p0
p

.1329  .15
 0.8470 . Make a diagram of a normal curve with a
.02019
mean at zero and a reject zone below - z   z.05  1.645 . Since z  0.8470 is not in the
'reject' zone, do not reject H 0 . We cannot say that the proportion who do not recall is
significantly below 15%. We can use this to get a p-value. Since our alternate hypothesis is
.1329  .15 

H 1 : p  .15 , we want a down-side value, i.e. P p  .1329   P z 

.02019 

 Pz  .8470   Pz  0  P0.85  z  0  .5  .3023  .1927 . Since the p-value is above the
significance level, do not reject H 0 . Make a diagram. Draw a Normal curve with a mean at .15
and represent the p-value by the area below .1329, or draw a Normal curve with a mean at zero and
represent the p-value by the area below -.85.
(ii) Since the alternative hypothesis says H 1 : p  .15 we need a critical value that is below .15.
We use pcv  p0  z  p  .15 1.645.02019  .1168. Make a diagram of a normal curve with a
mean at .15 and a ‘reject’ zone below .1168. Since p  .1329 is not in the 'reject' zone, do not
reject H 0 . We cannot say that the proportion is significantly below 15%.
pq
. To make the 2-sided confidence interval,
n
p  p  z 2 s p , into a 1-sided interval, go in the same direction as H 1 : p  .15 We get
(iii) To do a confidence interval we need s p 
pq
.1369 .8671 

 .000376  .01938 . Thus the interval is p  p  z s p
n
316
 .1369  1.645 .01938   .1687 . p  .1687 does not contradict the null hypothesis.
sp 
7
10
 .06087 , n1  115 and p 2 
 .09709 , n 2  103 .
115
103
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
pcv  p0  z 2  p
p  p 0
p  p  z 2 sp
H 0 : p  p0
z
If p0  0
 p
H 1 : p  p0
p  p1  p2
b) We are comparing p 3 
Interval for
Difference
between
proportions
q  1 p
s p 
p1q1 p2 q 2

n1
n2
p 0  p 01  p 02
or p 0  0
 p 
If p  0
 p 
p01q 01 p02 q 02

n1
n2
Or use s p
s p 
p0 q 0  1 n1 
1
n2

n p  n2 p2
p0  1 1
n1  n 2
p3 q3 p 2 q 2
.06087 .93913  .09709 .90291 



 .00049709  .00085108  .00134808  .03672
n3
n2
115
103
p  p3  p 2  .03622 ,
p0 
n p  n 2 p 2 115 .06087   103 .09709 
7  10
 .07798 (Yes p 0  1 1

 .07798 , but why waste
115  103
n1  n 2
115  103
your time?)   .05, z  z.05  1.645 . Note that q  1  p and that q and p are between 0 and 1.
 p  p 0 q 0

1
n1

1
n3

.07798 .92202  1115  1103 
.0013233  .03638
10
5/01/03 252y0341
H 0 : p 3  p 2
H 0 : p 3  p 2  0
H0 : p  0
Our hypotheses are 
or 
or 
H1 : p 0
H 1 : p 3  p 2
H1 : p 3  p 2  0
There are three ways to do this problem. Only one is needed
p  p 0  .03622  0

 0.9956 Make a Diagram showing a 'reject'
(i) Test Ratio: z 
 p
.03638
region below -1.645. Since -0.9956 is above this value, do not reject H 0 .
(ii) Critical Value: pcv  p0  z  p becomes pcv  p0  z p
2
 0  1.645 .03638   .5985 . Make a Diagram showing a 'reject' region below - 0.6985.
Since p  .03622is not below this value, do not reject H 0 .
(iii) Confidence Interval:: p  p  z s p becomes p  p  z sp
2
 .03622  1.645 .03672   0.02418 . Since
p  .02418 does not contradict p  0 , do not
reject H 0 .
c)
DF  r  1c  1  12  2
H 0 : Homogeneousor p1  p 2  p 3 
H 1 : Not homogeneousNot all ps are equal
O
R
F
Total
M
 25

 73
98
T
010
R Total
007  042

093 108  274
103 115
316
pr
.13291
.86709
1.0000
.2052   5.9915
E
On
1
2
3
Total
pr
13 .0252 13 .6897 15 .2846  42 .000
.13291


Oft
84 .9748 89 .3103 99 .7154  274 .000 .86709
Total 98 .0000 103 .000 115 .000 316 .000 1.00000
The proportions in rows, p r , are used with column totals to get the items in E . Note that row and column
sums in E are the same as in O . (Note that  2  19.0224  335.0229  316 is computed two different
ways here - only one way is needed. Too many of you wasted your time computing both of the last two
columns, with E so large, you only needed the O and E columns and the last column)
Row
1
2
3
4
5
6
O
25
73
10
93
7
108
316
E
13.0252
84.9748
13.6897
89.3103
15.2846
99.7154
316.000
E O
-11.9748
11.9748
3.6897
-3.6897
8.2846
-8.2846
0.0000
E  O2
143.396
143.396
13.614
13.614
68.635
68.635
O  E 2
E
11.0092
1.6875
0.9945
0.1524
4.4905
0.6883
19.0224
O2
E
47.984
62.713
7.305
96.842
3.206
116.973
335.022
Since the  2 computed here is greater than the  2 from the table, we reject H 0 .
11
5/01/03 252y0341
d) The Marascuilo procedure says that if (i) equality is rejected in c) and
 
(ii) p 2  p3   2 s p , where the chi – squared is what you used in c) and the standard deviation is
2
what you would use in a confidence interval solution to b), you can say that you have a significant
difference between TV and Radio.
OK – We already have DF  r  1c  1  12  2,  .2052  5.9915
s p 
p3 q3 p 2 q 2
.06087 .93913  .09709 .90291 



 .00049709  .00085108  .00134808  .03672
n3
n2
115
103
p  p3  p 2  .03622 . I guess we really should use
 .2025  7.3778  2.7162 and
 22 s p   2.7162 .03672   .09974 . Since p 2  p3 is obviously smaller than this, we do not have a
significant difference in these 2 proportions.
12
5/01/03 252y0341
2. (Berenson et. al. 1142) A manager is inspecting a new type of battery. These are subjected to 4 different
pressure levels and their time to failure is recorded. The manager knows from experience that such data is
not normally distributed. Ranks are provided.
PRESSURE
Use
low
1
2
3
4
5
8.0
8.1
9.2
9.4
11.7
rank normal
11
12
15
16
19
7.6
8.2
9.8
10.9
12.3
rank
high
rank
whee!
rank
8
13
17
18
20
6.0
6.3
7.1
7.7
8.9
4
5
7
9
14
5.1
5.6
5.9
6.7
7.8
1
2
3
6
10
a. At the 5% level analyze the data on the assumption that each column represents a random
sample. Do the column medians differ? (5)
b. Rerank the data appropriately and repeat a) on the assumption that the data is non-normal but
cross classified by use. (5)
c. This time I want to compare high pressure (H) against low - moderate pressure (L). I will write
out the numbers 1-20 and label them according to pressure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
H H H H H H H L H
H L L L H L L L L L L
Do a runs test to see if the H’s and L’s appear randomly. This is called a Wald-Wolfowitz test for the
equality of means in two nonnormal samples. Null hypothesis is that the sequence is random and the means
are equal. What is your conclusion? (5)
Solution: a) This is a Kruskal – Wallis Test, Equivalent to one-way ANOVA when the underlying
distribution is non-normal.
H 0 : Columns come from same distribution or medians equal. I am basically copying the outline. There are
n  20 data items, so rank them from 1 to 20. Let n i be the number of items in column i and SRi be the
rank sum of column i . n 
1
2
3
4
5
8.0
8.1
9.2
9.4
11.7
11
12
15
16
19
SR1  73
n
i
7.6
8.2
9.8
10.9
12.3
.
8
13
17
18
20
SR2  76
6.0
6.3
7.1
7.7
8.9
4
5
7
9
14
SR3  39
5.1
5.6
5.9
6.7
7.8
1
2
3
6
10
SR4  22
To check the ranking, note that the sum of the four rank sums is 73.0 + 76.0 + 39.0 + 22.0 = 210.0, and
nn  1 20 21

 210 .
that the sum of the first n numbers is
2
2
 12
 SRi 2 

  3n  1
Now, compute the Kruskal-Wallis statistic H  
 nn  1 i  ni 
 12  73 .02 76 .02 39 2 22 2 

  321  12 1 5329  5776  1521  484   63  11 .9143 .








20
21
5
5
5
5
420 5




If the size of the problem is larger than those shown in Table 9, use the  2 distribution, with df  m  1  3 ,
where m is the number of columns. Compare H with  .2053  7.81475 . Since H is larger than  .205 , reject
the null hypothesis.
13
5/01/03 252y0341
b) This is a Friedman test ,equivalent to two-way ANOVA with one observation per cell when the
underlying distribution is non-normal.
H 0 : Columns come from same distribution or medians equal. Note that the only difference between this
and the Kruskal-Wallis test is that the data is cross-classified in the Friedman test.
1
2
3
4
5
8.0
8.1
9.2
9.4
11.7
4
3
3
3
3
SR1  16
7.6
8.2
9.8
10.9
12.3
3
4
4
4
4
SR2  19
6.0
6.3
7.1
7.7
8.9
2
2
2
2
2
SR3  10
5.1
5.6
5.9
6.7
7.8
1
1
1
1
1
SR4  5
Assume that   .05 . In the data, the pressures are represented by c  4 columns, and the uses by
r  5 rows.. In each row the numbers are ranked from 1 to c  4 . For each column, compute SRi , the rank
sum of column i .
To check the ranking, note that the sum of the four rank sums is 16 + 19 + 10 + 5 = 50, and that the sum of
cc  1
the c numbers in a row is
. However, there are r rows, so we must multiply the expression by r .
2
rcc  1 545
SRi 

 50 .
So we have
2
2
 12

SRi2   3r c  1
Now compute the Friedman statistic  F2  
 rc c  1 i


 


 12
16 2  19 2  10 2  52   355   12 256  361  100  25   75  14 .04 . Since the

100

 545

size of the problem is larger than those shown in Table 10, use the  2 distribution, with df  c  1 ,
where c is the number of columns. Again,  .2053  7.81475 . Since our statistic is larger than  .205 , reject the
null hypothesis.
c) . This time I want to compare high pressure (H) against low - moderate pressure (L). I will write out the
numbers 1-20 and label them according to pressure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
H H H H H H H L H
H L L L H L L L L L L
If we do a runs test, n  20 , r  6, n1  10 and n 2  10 . Can you see why r  6 ? I have underlined
alternate runs to help you. The runs test table gives us critical values of 6 and 16. The directions on the table
say to reject randomness if r  6 or r  16 . Since the sequence is not random, the means are not equal.
14
5/01/03 252y0341
3. A researcher studies the relationship of numbers of subsidiaries and numbers of parent companies in 11
metropolitan areas and finds the following:
Area
parents
1
2
3
4
5
6
7
8
9
10
11
653
391
352
261
226
218
202
151
141
138
134
2867
subsidiaries
x2
xy
426409
152881
123904
68121
51076
47524
40804
22801
19881
19044
17956
990401
1702371
670174
653664
320508
198880
146278
307848
134239
67962
78522
88708
4369154
y
x
2607
1714
1857
1228
880
671
1524
889
482
569
662
13083
y2
6796449
2937796
3448449
1507984
774400
450241
2322576
790321
232324
323761
438244
20022545
a. Do Spearman’s rank correlation between x and y and test it for significance (6)
b. Compute the sample correlation between x and y and test it for significance (6)
c. Compute the sample standard deviation of x and test to see if it equals 200 (4)
Solution: a) First rank x and y into rx and r y respectively. (As usual, people tried to compute rank
correlations without ranking, I warned you!) Compute d  rx  ry and d 2 .
Area
parents
1
2
3
4
5
6
7
8
9
10
11
subsidiaries
x
y
rx
r y d  rx  ry d 2
653
391
352
261
226
218
202
151
141
138
134
2607
1714
1857
1228
880
671
1524
889
482
569
662
11
10
9
8
7
6
5
4
3
2
1
66
11
9
10
7
5
4
8
6
1
2
3
66
0
1
-1
1
2
2
-3
-2
2
0
-2
0
0
1
1
1
4
4
9
4
4
0
4
32
 d  0 is a check on the correctness of the ranking.
6 d
192
632 
From the outline: r  1 
 1
 1
 1  0.1455  0.8545
11121 
nn  1
1111  1
Note that
2
s
2
2
The 11 line from the rank correlation table has
n
  .050
  .025
  .010
  .005
11
.5273
.6091
.7000
.7818
H 0 :  s  0
If you tested 
at the 5% level, reject the null hypothesis of no relationship if rs is above .5273
H 1 :  s  0
H 0 :  s  0
or, if you tested 
at the 5% level, reject the null hypothesis if rs is above .6091. So, in this
H 1 :  s  0
case we have a significant rank correlation.
15
5/01/03 252y0341
b) We need the usual spare parts. From above n  11,
 xy  4369154 and  y
2
 x  2867,  y  13083,  x
2
 990401,
 20022545. (In spite of the fact that most column computations were done
for you, many of you wasted time and energy doing them over again.)
Spare Parts Computation:
x 2867
x

 260 .636
n
11
SSx 


y
n
y

x
 nx 2  990401  11260 .636 2
2
 243158 .6
Sxy 
 xy  nxy  4369154 11260 .636 1189 .36 
 959263 .6
13083
 1189 .36
11
SSy 
y
2
ny 2  20022545  111189 .36 2
 4462195 .7
The simple sample correlation coefficient is r 
 XY  nXY
 X  nX  Y
2
R
2
 nY 2
square root of
 XY  nXY 
Sxy 
959263 .6


 X  nX  Y  nY  SSxSSy 243158 .34462195 .7  .8480827
2
2
2
2
2
2
2
2
2
r  .8480827  .9209 . From the outline, if we want to test H 0 : xy  0 against H1 : xy  0 and
x and y are normally distributed, we use t n  2  
r

1 r
n2
2
.9209
1  .9209 
11  2

.9209
2
.01688

.9209
 7.088 .
.12993
9
 2.262 , we reject H 0 . Note that R-squared is always between 0 and 1 and that correlations are
Since t .025
always between -1 and +1.
c) From the formula table (but the outline is better)
Interval for
Confidence
Hypotheses
Interval
VarianceH 0 :  2   02
n  1s 2
2


Small Sample
 .25 .5 2 
H1: :  2   02
VarianceLarge Sample
 
s 2DF 
 z 2  2DF 
We already know n  11 and SSx 
x

2
 nx 2
H 0 :  2   02
x
2 
Critical Value
n  1s 2
 cv2 
 02
n  1s 2
 .25.5 2 
z 
2  2  2DF   1
H1 :  2   02
2
Test Ratio
 nx 2  990401 11260.6362  243158.6 . Assume   .05 .


SSx 243158 .6

 24315 .86 s  24315 .86  155 .90
n 1
n 1
10
H 0 :   200 2 n  1s 2 10 24315 .86 
 

 6.0790 .
Only the test ratio method is normally used. 
 02
200 2
H 1 :   200
then
s x2

10  20 .4832 Make a diagram. Show a curve with a mean at 10 and rejection zones
DF  n  1  10.  .2025
10  20 .4832 and below  210  3.2470 . Since your value of 2 is between the two
(shaded) above  .2025

.975
critical values, do not reject H 0 .
16
5/01/03 252y0341
4. Data from the previous page is repeated:
Area
parents
1
2
3
4
5
6
7
8
9
10
11
653
391
352
261
226
218
202
151
141
138
134
2867
subsidiaries
x2
xy
426409
152881
123904
68121
51076
47524
40804
22801
19881
19044
17956
990401
1702371
670174
653664
320508
198880
146278
307848
134239
67962
78522
88708
4369154
y
x
2607
1714
1857
1228
880
671
1524
889
482
569
662
13083
y2
6796449
2937796
3448449
1507984
774400
450241
2322576
790321
232324
323761
438244
20022545
a. Test the hypothesis that the correlation between x and y is .8 (5)
b. Test the hypothesis that x has the Normal distribution. (9)
c. Test the hypothesis that x and y have equal variances. (4)
Solution: a) From the previous page we know n  11 and
 XY  nXY 
Sxy 
959263 .6


 X  nX  Y  nY  SSxSSy 243158 .34462195 .7  .8480827
2
R
2
2
2
2
r  .8480827  .9209 or r 
2
2
2
959283 .6
493 .112112 .39 
. From the Correlation section of the outline:
If we are testing H 0 : xy   0 against H 1 : xy   0 , and  0  0 , the test is quite different.
1 1 r 
z  ln 
We need to use Fisher's z-transformation. Let ~
 . This has an approximate mean
2  1 r 
~
n 2 
z  z
1
1  1 0 
 and a standard deviation of s z 

of  z  ln 
, so that t
.

n3
sz
2  1 0 
(Note: To get ln , the natural log, compute the log to the base 10 and divide by .434294482. )
H 0 : xy  0.8
Test 
when n  11, r  .9209 and r 2  .8481   .05  .
H 1 : xy  0.8
1  1  r  1  1  .9209  1  1.9209  1
1
~
z  ln 
  ln 
  ln 
  ln 24 .28445   3.18984   1.5949
2  1  r  2  1  .9209  2  0.0791  2
2
1  1 0
 z  ln 
2  1 0
sz 
 1  1  .8  1  1.8  1
1
  ln 
 2  1  .8   2 ln  0.2   2 ln 9.0000   2 2.19722   1.09861

~
z   z 1.5949  1.09861
1
1
1


 0.35355 .

 1.404 . Compare this
Finally t 
n3
11  3
8
sz
0.35355
with  t n2 2  t .9025  2.262 . Since 1.404 lies between these two values, do not reject the null hypothesis.
17
5/01/03 252y0341
1 1 r 
~
z10  log 
 . This has an approximate mean of
2  1 r 
~
n  2 
z   z 10
0.18861
and a standard deviation of s z 10 
, so that t
.
 10
n3
s z 10
Note: To do the above with logarithms to the base 10, try
1 1 0
 z 10  log
2  1 0




b) From the previous page we know n  11 , x 
 x  nx
 x  nx

SSx 
2
2
2
 x  2867  260 .636 ,
n
11
 990401 11260.636  243158.6 and
2

2

SSx 243158 .6


 24315 .86 s  24315 .86  155 .90
n 1
n 1
10
Use the setup in Problem E9 or E10. The best method to use here is Lilliefors because the data is
not stated by intervals, the distribution for which we are testing is Normal, and the parameters of the
xx
distribution are unknown. We begin by putting the data in order and computing z 
(actually t ) and
s
proceed as in the Kolmogorov-Smirnov method. For example, in the second row
138  260 .636
O  11 - so
z
 0.79 , O  1 because there is only one number in each interval, n 
155 .90
each value of O is 1  .0909 . Since the highest number in the interval represented by Row 2 is
n
11
z  0.79 , Fe  F 0.79   Pz  0.79   Pz  0  P0.79  z  0  .5  .2852  .2146 .
s x2

Row
1
2
3
4
5
6
7
8
9
10
11
x
134
138
141
151
202
218
226
261
352
391
653
z
-0.81
-0.79
-0.77
-0.70
-0.38
-0.27
-0.22
0.00
0.59
0.84
2.52
O
O
Fo
n
D  Fe  Fo
Fe
1 .0909 .0909
1 .0909 .1818
1 .0909 .2727
1 .0909 .3636
1 .0909 .4545
1 .0909 .5455
1 .0909 .6364
1 .0909 .7273
1 .0909 .8192
1 .0909 .9091
1 .0909 1.0000
11 0.9999
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
-
.2910
.2852
.2794
.2580
.1480
.1064
.0871
=
=
=
=
=
=
=
=
+ .2224 =
+ .2995 =
+ .4941 =
.2090
.2146
.2206
.2420
.3520
.3936
.4129
.5000
.7224
.7995
.9941
.1181
.0328
.0521
.1216
.1025
.1519
.2235
.2273
.0968
.1096
.0059
From the Lilliefors table for   .05 and n  11 , the critical value is .249. Since the maximum
deviation (.2273) is below the critical value, we do not reject H 0 .
c) From the previous page SSy 
s 2y 
y
2
 ny 2
n 1

y
2
 nxy 2  20022545 111189.362  4462195.7
SSy 4462195 .7

 446219 .57 s x2 
n 1
10
x
2
 nx 2
n 1
If we follow 252meanx4 in the outline: Our Hypotheses are H 0 :  x2   y2
SSx 243158 .6

 24315 .86
n 1
10
and H 1 :  x2   y2 .

DF x  n  1  10 and DFy  n 1  10 , Since the table is set up for one sided tests, if we wish to test
H 0 :  x2   y2 , we must do two separate one-sided tests. First test
 DFx, DFy 
10,10  3.72
F.025
 F.025
and then test
s 2y
s x2

s x2
s 2y

24315 .86
 0.0545 against
446219
1
 DFy, DFx
10,10  3.72
 F.025
. If
 18 .351 against F.025
.0545
either test is failed, we reject the null hypothesis. Since 18.351 is larger than this critical value, reject H 0 .
18
5/01/03 252y0341
5. Data from the previous page is repeated:
Area
parents
1
2
3
4
5
6
7
8
9
10
11
653
391
352
261
226
218
202
151
141
138
134
2867
subsidiaries
x2
xy
426409
152881
123904
68121
51076
47524
40804
22801
19881
19044
17956
990401
1702371
670174
653664
320508
198880
146278
307848
134239
67962
78522
88708
4369154
y
x
2607
1714
1857
1228
880
671
1524
889
482
569
662
13083
y2
6796449
2937796
3448449
1507984
774400
450241
2322576
790321
232324
323761
438244
20022545
a. Compute a simple regression of subsidiaries against parents as the independent variable. (5)
b. Compute s e . (3)
c. Predict how many subsidiaries will appear in a city with 50 parent corporations. (1)
d. Make your prediction in c) into a confidence interval. (3)
e. Compute s b0 and make it into a confidence interval for  0 . (3)
f. Do an ANOVA for this regression and explain what it says about 1 . (3)
Solution: We need the usual spare parts. From above n  11,
x
2
 990401,
 xy  4369154 and  y
2
 x  2867,  y  13083,
 20022545.
Spare Parts Computation:
(Repeated from previous page)
SSx 
x
 x  2867  260 .636
x
 243158 .6
 y  13083  1189 .36
y
 959263 .6
n
11
n
11
Sxy 
SSy 
2
 nx 2  990401  11260 .636 2
 xy  nxy  4369154 11260 .636 1189 .36 
y
2
ny 2  20022545  111189 .36 2
 4462195 .7
b1 
Sxy

SSx
 xy  nxy  959263 .6  3.9450
 x  nx 243158 .6
2
b0  y  b1 x  1189 .36  3.9450 260 .636  161 .15
2
Yˆ  b0  b1 x becomes Yˆ  161.15  3.9450 x .
b) We already know that SST  SSy 
y
2
 nxy 2  20022545 111189.362  4462195.7 and that
 XY  nXY 
Sxy 
959263 .6
R 

 .8480827




SSx
SSy
243158
.34462195 .7 
 X  nX  Y  nY 
SSR  b Sxy  b  xy  nx y   3.9450 959263 .6  3789091 .2 or
2
2
2
2
2
1
2
2
2
1
SSR  b1 Sxy  R 2 SST   .84808274462195.7  3784311
SSE  SST  SSR  4462195 .7  3784311  677984 .77
s e2 
SSE 677984 .77

 75331 .6
n2
9
19
5/01/03 252y0341
s e2
or
 y

2
  x
 ny 2  b12
2
 nx 2

n2
So s e  75331 .6  274 .447
(
s e2

4462195 .7  3.9450 2 243158 .3
9

677917
 75324 .1
9
is always positive!)
c) If Yˆ  161.15  3.9450 x and x  50 , the prediction is Yˆ  161.15  3.945050  358.4.


2
1

X0  X
 . In
d) From the outline, the Confidence Interval is  Y0  Yˆ0  t sYˆ , where sY2ˆ  s e2  
n
X 2  nX 2 


ˆ
ˆ
this formula, for some specific X 0 , Y0  b0  b1 X 0 . Here X 0  50 , Y0  358.4 , X  260.636 and
n  11 . . Then

X 0  X 2
2

  75331 .6 1  50  260 .636    75331 .6.27337   20593 .6 and
 11
243158 .6 
n
X 2  nX 2 



9
sYˆ  20593.6  143.505 , so that, if tn  2  t.025
 2.262 the confidence interval is
sY2ˆ



1
s e2 



2
 Y0  Yˆ0  t sYˆ  358 .4  2.262 143 .5  358  325 . This represents a confidence interval for the average
value that Y will take when x  50 and is proportionally rather gigantic because we have picked a point
fairly far from the data that was actually experienced.
e) The outline says
2
1


X2
  75331 .6 1  260 .636    75331 .60.37025   27891 .35 . So
s b20  s e2  
n
X 2  nX 2 
11 243158 .6 



s b  27891 .35  167 .01 .
0
So the interval is  0  b0  t  sb0  161.15  2.262167.01  161 378. This indicates that the intercept is
2
not significant.
f) We can do a ANOVA table as follows:
Source
SS
DF
MS
Regression
SSR
MSR
1
Error
Total
SSE
SST
n2
n 1
F
MSR MSE
MSE
Source
SS
DF MS
F
Regression
3784311
1 3784311 50.234
Error
677985
9
75332
Total
4462196 11
1,9   5.12
Note that F.05
so we reject the null hypothesis that x and y are unrelated. This is the same as
saying that H 0 : 1  0 is false.
20
5/01/03 252y0341
The Minitab output for this problem follows:
The regression equation is
subno = 161 + 3.94 parno
Predictor
Constant
parno
S = 274.4
Coef
161.2
3.9450
SE Coef
167.0
0.5566
R-Sq = 84.8%
T
0.97
7.09
P
0.360
0.000
R-Sq(adj) = 83.1%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
9
10
SS
3784219
677882
4462101
Unusual Observations
Obs
parno
subno
1
653
2607.0
7
202
1524.0
MS
3784219
75320
Fit
2737.2
958.0
F
50.24
SE Fit
233.5
89.0
P
0.000
Residual
-130.2
566.0
St Resid
-0.90 X
2.18R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large
21
5/01/03 252y0341
6. A chain has the following data on prices, promotion expenses and sales of one product (You can do
x x
1 2
100.
Store
1
2
3
4
5
6
7
8
9
10
11
12
): Computation of this sum is in red. It might help to divide y , x1 y and x 2 y by 10 and y 2 by
sales
promotion
y
x1
x2
x12
4141
3754
5000
4011
3224
2618
3746
3825
1096
1882
2159
2927
38383
59
59
59
59
79
79
79
79
99
99
99
99
948
200
400
600
600
200
400
600
600
200
400
400
600
5200
3481
3481
3481
3481
6241
6241
6241
6241
9801
9801
9801
9801
78092
y2
x 22
Store
1
2
3
4
5
6
7
8
9
10
11
12
price
40000
160000
360000
360000
40000
160000
360000
360000
40000
160000
160000
360000
2560000
x1 y
17147881
14092516
25000000
16088121
10394176
6853924
14032516
14630625
1201216
3541924
4661281
8567329
136211509
244319
221486
295000
236649
254696
206822
295934
302175
108504
186318
213741
289773
2855417
x1 x 2
11800
23600
35400
35400
15800
31600
47400
47400
19800
39600
39600
59400
406800
x2 y
828200
1501600
3000000
2406600
644800
1047200
2247600
2295000
219200
752800
863600
756200
17562800
y  3198.58, x1  79.0000 and x 2  433.333.
a. Do a multiple regression of sales against x1 and x 2 . (10)
b. Compute R 2 and R 2 adjusted for degrees of freedom. Use a regression ANOVA to test the usefulness
of this regression. (6)
d. Use your regression to predict sales when price is 79 cents and promotion expenses are $200. (2)
e. Use the directions in the outline to make this estimate into a confidence interval and a prediction interval.
(4)
f. If the regression of Price alone had the following output: The regression equation is
sales = 7564 - 55.3 price
Predictor
Constant
price
Coef
7564.3
-55.26
S = 605.6
SE Coef
863.6
10.71
R-Sq = 72.7%
T
8.76
-5.16
P
0.000
0.000
R-Sq(adj) = 70.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
10
11
SS
9772621
3667664
13440285
MS
9772621
366766
F
26.65
P
0.000
Do an F-test to see if adding x 2 helped. (4). The next page is blank – please show your work.
I suggested that we divide y values by 10 and y squared by 100 to make computations more tractable. No
one did, so I have redone this on the following page. After a year of statistics, too many of you decided that
x1 y  ?
x1
x 2 and did something similar for some of the other sums. Where have you been?

  
5/01/03 252y0341
22
 y  3838.3,  x1  948,  x 2  5200,  x  78092,
 y  1362115.09,  x y  285541.7,  x y  1756280.0 and you should have found
Solution: a) (With y divided by 10) n  12,
 x  2560000,
 x x  406800.
2
2
2
1
2
1
2
1 2
 y  3838 .3  319 .858 , x   x
First, we compute y 
x2 
x

2
n
n
1
12
948
 79 .00 , and
12

1
n
5200
 433 .333 . Then, we compute our spare parts:
12
SST  SSy 
y
 ny  1362115 .09  12 319 .858 2  134405 *
2
2
 x y  nx y  285541 .7  1279 319 .858   17684
Sx y   X Y  nX Y  1756280  12433 .333 319 .858   93020
SSx1   x12  nx12  78092  12792  3200*
SSx2   X 22  nX 22  2560000 12433.3332  306670 *
and Sx x   X X  nX X  406800  1279 433 .333   4000 .
Sx1 y 
1
2
1
2
2
1 2
1
2
1
2
* indicates quantities that must be positive.
Then we substitute these numbers into the Simplified Normal Equations:
X 1Y  nX 1Y  b1
X 12  nX 12  b2
X 1 X 2  nX 1 X 2


 X Y  nX Y  b  X X
2
2
1
1
2
 
 nX X   b  X
1
2
2
2
2

 nX  ,
2
2
 17684  3200 b1  4000 b2

 93020   4000 b1  306670 b2
and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations
to solve until we notice that, if we multiply 4000 by 0.8 we get 3200. The equations become
  17684 .0  3200 b1  4000 b2
If we add these together, we get 56732  241336 b2 . This means that

 74416 .0   3200 b1  245336 b2
56732
b2 
 0.23507 . Now remember that 3200 b1  4000 b2  17684  4000 .23507   17684
241336
16743 .72
 5.2324 . Finally we get b0 by solving b0  Y  b1 X 1  b2 X 2
 16743 .72 . So b1 
3200
 319 .858  5.232 79.00   0.2351 433 .333   319 .858  413 .328  101 .877  631 .31 . Thus our
equation is Yˆ  b  b X  b X  631.31 5.232X  0.2351X .
which are
0
1
1
2
2
2
2
2

 x  948,  x  5200,  x  78092,
 y  136211509,  x y  2855417,  x y  17562800 and you should have found
a) (The way most of you did it) n  12,
 x  2560000,
 x x  406800.
1
y  38383,
1
2
2
1
2
1
2
1 2
First, we compute y 
x2 
x
n
2

 y  38383  3198 .58 , x   x
n
12
1
n
1

948
 79 .00 , and
12
5200
 433 .333 . Then, we compute our spare parts:
12
23
5/01/03 252y0341
SST  SSy 
y
 ny  136211509  12 3198 .58 2  13440541 *
2
2
 x y  nx y  2855417  1279 3198 .58   176837
Sx y   X Y  nX Y  17562800  12433 .333 3198 .58   930197
SSx1   x12  nx12  78092  12792  3200*
SSx2   X 22  nX 22  2560000 12433.3332  306670 *
and Sx x   X X  nX X  406800  1279 433 .333   4000 .
Sx1 y 
1
2
1
2
2
1 2
1
2
1
2
* indicates quantities that must be positive.
Then we substitute these numbers into the Simplified Normal Equations:
X 1Y  nX 1Y  b1
X 12  nX 12  b2
X 1 X 2  nX 1 X 2


 X Y  nX Y  b  X X
2
2
1
1
2
 
 nX X   b  X
1
2
2
2
2

 nX  ,
2
2
 176837  3200 b1  4000 b2

 930197   4000 b1  306670 b2
which are
and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations
to solve until we notice that, if we multiply 4000 by 0.8 we get 3200. The equations become
  176837 .0  3200 b1  4000 b2
If we add these together, we get 567220 .6  241336 b2 . This means

 7444157 .6   3200 b1  245336 b2
567220 .6
 2.3503 . Now remember that
241336
3200 b1  4000 b2  176837  4000 2.3503   176837  167435 .8
that b2 
So b1 
167435 .8
 52 .324 . Finally we get b0 by solving b0  Y  b1 X 1  b2 X 2
3200
 3198 .58  52 .324 79 .00   2.3503 433 .333   3198 .58  4133 .596  1018 .4625  6313 .71 . Thus our
equation is Yˆ  b0  b1 X 1  b2 X 2  6313.71 52.324X 1  2.3503X 2 .
b) (The way I did it) SSE  SST  SSR and so
SSR  b1 Sx1 y  b2 Sx2 y  5.232 17684   0.2351 93020   92523  21869  114392 *
SST  SSy 
R2 
y
2
2
 ny  134405 so SSE  134405  114392  20013 *
SSR 114392

 0.851 . If we use R 2 , which is R 2 adjusted for degrees of
SST 134405
freedom R 2 
n  1R 2  k  110.851  2  .818 .
n  k 1
Source
Regression
Residual Error
Total
DF
2
9
12
9
SS
114392
20013
134405
the ANOVA reads:
MS
57186
2223.7
F
25.72
F.05
4.26
Since our computed F is larger that the table F, we reject the hypothesis that X and Y are unrelated.
b) (The way most of you did it) SSE  SST  SSR and so
SSR  b1 Sx1 y  b2 Sx2 y  52.324 176837   2.3503 930197   9252819  2186242  11439061 *
SST  SSy 
y
2
2
 ny  13440541 so SSE  13440541  11439061  2001480
24
5/01/03 252y0341
R2 
SSR 11439061

 0.850 . If we use R 2 , which is R 2 adjusted for degrees of
SST 13440541
freedom R 2 
n  1R 2  k  110.850  2  .817 .
Source
Regression
Residual Error
Total
n  k 1
9
DF
2
9
12
SS
11439061
2001480
13440541
the ANOVA reads:
MS
5719530
222386.7
F
25.72
F.05
4.26
Since our computed F is larger that the table F, we reject the hypothesis that X and Y are unrelated.
c) X 1  79, X 2  200 (My way) Yˆ  631.31 5.23279  0.2351200  631.31 413.33  47.02  265.00
(Your way) Yˆ  6313.71 52.32479  2.3503200  6313.71 4133.60  470.06  2650.17
d) We need to find s e . The best way to do this is to do an ANOVA or remember that s e2 
SSE
, and
n  k 1
that you got SSE  13440541  11439061  2001480
SSE 2001480
9
 2.262 . The outline says that an approximate
s e2 

 222386 .67 , so s e  471 .6 . t .025
n2
9
s
471 .6
 2650  308 and an approximate
confidence interval is  Y0  Yˆ0  t e  2650 .17  2.262
12
n
prediction interval is Y  Yˆ  t s  2650 .17  2.262 471 .6  2650  1067 .
0
0
e
e) We can copy the Analysis of Variance in the question
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
10
11
SS
9772621
3667664
13440285
MS
9772621
366766
F
26.65
P
0.000
But we just got
Source
Regression
Residual Error
Total
DF
2
9
12
SS
11439061
2001480
13440541
MS
5719530
222386.7
F
25.72
F.05
4.26
Let’s use our SST and itemize this as
Source
DF
SS
X1
1
9772621
X2
1
1666440
9
12
2001480
13440541
Residual Error
Total
MS
F
9772621
43.94
1666440
7.49
F.05
1,9   5.12
F.05
F 1,9   5.12
.05
222386.7
The appropriate F to look at is opposite X2. Since 8.10 is above the table value, reject the null hypothesis of
no relationship. Yes the added variable helped.
25
5/01/03 252y0341
I also ran this on the computer.
Regression Analysis: sales versus price, promotion
The regression equation is
sales = 6314 - 52.3 price + 2.35 promotion
Predictor
Constant
price
promotio
Coef
6313.6
-52.324
2.3507
S = 471.5
SE Coef
812.9
8.404
0.8584
R-Sq = 85.1%
T
7.77
-6.23
2.74
P
0.000
0.000
0.023
R-Sq(adj) = 81.8%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
price
promotion
DF
1
1
DF
2
9
11
SS
11439515
2000770
13440285
MS
5719757
222308
F
25.73
P
0.000
Seq SS
9772621
1666894
26
5/01/03 252y0341
7. The Lees present the following data on college students summer wages vs. years of work experience
blocked by location.
Years of Work Experience
Region
1
2
3
1
16
19
24
2
21
20
21
3
18
21
22
4
14
21
26
a. Do a 2-way ANOVA on these data and explain what hypotheses you test and what the
conclusions are. (9) (Or do a 1-way ANOVA for 6 points.) The following column sums are done for you:
x
1
 69,
x
2
 81, n1  4, n 2  4,
x
2
1
 1217 and
x
2
2
 1643. So x1  17.25,and x 2  20.25.
b. Do a test of the equality of the means in columns 1 and 3 assuming that the columns are random
samples from Normal populations with equal variances (4).
c. Assume that columns 1 and 3 do not come from a Normal distribution and are not paired data
and do a test for equal medians. (4)
d. Test the following data for uniformity. n  20. (6)
Category
1
2
3
4
5
Numbers
0
2
0
10
8
Solution: a) This problem was on the last hour exam.
2-way ANOVA (Blocked by region) ‘s’ indicates that the null hypothesis is rejected.
Region Exper 1 Exper 2 Exper 3 sum count mean
Sum of squares
x i.. n i
SS
x1
x2
x3
x i.
x i2.
1
16.0
19.0 24.0
59.0 3 19.6667
1193
386.78
2
21.0
20.0 21.0
62.0 3 20.6667
1282
427.11
3
18.0
21.0 22.0
61.0 3 20.3333
1249
413.44
4
14.0
21.0 26.0
61.0 3 19.3333
1313
413.44
Sum
69.0
+81.0 +93.0
=243.0 12 20.25
5037
1640.78
4
+4
+4
= 12
nj

 x  243 ,
From the above
x
 x  243  20.25 .
n
SSC 
20.25  x
=5037
=1248.19
17.25 20.25 23.25
1217 +1643
+2177
297.56 +410.06+540.56
x j
SS
x 2j
 n
12
2
j x j
n  12 ,
SST 
 x
 x
2
ij
 5037 ,
2
ij
x
2
i.
 1640 .78
x
2
.j
 1248 .19 and
 n x  5037  12 20 .25 2  5037  4920 .75  116 .25 .
2
 n x  41248 .19   12 20 .25 2  4992 .76  4920 .75  72 . This is SSB in a one way
2
ANOVA.
SSR 
( SSW  SST  SSC  SSR  52.0 )
 n x
2
i i.
 n x  31640 .78   12 20 .25 2  4922 .34  4920 .75  1.59
Source
Rows (Regions)
SS .
1.59
DF
3
Columns(Experience)
72.00
2
2
MS .
0.53
F.
0.075
36.00
5.062
F.05
F 3,6  4.76 ns
F 2,6  5.14 s
H0
Row means equal
Column means equal
Within (Error)
42.67
6
7.112
Total
116.25
11
So the results characterized by years of experience (column means) are significantly different.
27
5/01/03 252y0341
Note that if you did this as a 1-way ANOVA, the SS and DF in the Rows line in the table would be
added to the Within line.
Computer version:
Two-way ANOVA: C40 versus C41, C42
Analysis of Variance for C40
Source
DF
SS
MS
C41
3
1.58
0.53
C42
2
72.00
36.00
Error
6
42.67
7.11
Total
11
116.25
F
0.07
5.06
P
0.972
0.052
b) The data that we can use is repeated here.
1217  417 .25 2
2177  423 .25 2
 8.91667 , n1  4, x 3  23 .25, s 32 
 4.91667 , n3  4.
3
3
H 0 : 1   2 H 1 : 1   2
Test Ratio Method:
d  x  x1  x3  17.25  23.25  6.00 DF  n1  n2  2  4  4  2  6   .05,
x1  17 .25, s12 
sˆ2p 
n1  1s12  n2  1s22
n1  n2  2
sd  sx  sˆ p
1 1


n1 n2
=
38.91667   34.91667  26 .75  14 .75

 6.91667
6
6
6.91667  1  1   6.916667 0.5 
4
4
6
t .025
 2.445
3.458333  1.85966
 H 0 : 1   2
 H :   0
Our hypotheses are 
or  0
 H 1 : 1   2
 H 1 :   0
x   0  6.00  0
t

 3.226 so reject null hypothesis.
s x
1.85966
Of course, you could also do this with a critical value. If D  1   2 , then
d cv  D0  ts d  0  2.445 1.85966  , or a confidence interval, both of which should give you the same
answer.
c) The null hypothesis is H 0 : Columns come from same distribution or medians are equal.
The data are repeated in order. The second number in each column is the rank of the number among the 11
numbers in the two groups.
14 1
21 4.5
16 2
22 6
18 3
24 7
21 4.5
26 8 .
10.5
25.5
Since this refers to medians instead of means and if we assume that the underlying distribution is not
Normal, we use the nonparametric (rank test) analogue to comparison of two sample means of independent
samples, the Wilcoxon-Mann-Whitney Test. Note that data is not cross-classified so that the Wilcoxon
Signed Rank Test is not applicable.
H 0 : 1   2 H 1: 1   2 .
We get TL  10 .5 and TU  25 .5 . Check: Since the total amount of data is 4 + 4 = 8  n , 10.5 +25.5 must
equal
nn  1 88

 36 .They do.
2
2
28
5/01/03 252y0341
For a 5% two-tailed test with n1  4 and n 2  4 , Table 6 says that the critical values are 11 and 25. We
accept the null hypothesis in a 2-sided test if the smaller of thee two rank sums lies between the critical
values. The lower of the two rank sums, W  10.5 is not between these values, so reject H 0 .
d) This is basically Problem E11. This is the problem I used to use to introduce Kolmogorov-Smirnov. I
stopped because everyone seemed to assume that all K-S tests were tests of uniformity.
Five different formulas are used for a new cola. Ten tasters are asked to sample all five formulas
and to indicate which one they preferred. The sponsors of the test assume that equal numbers will prefer
each formula - this is the assumption of uniformity. Instead none prefer Formulas 1 and 3; 1 person prefers
Formula 2; 5 people prefer formula 4 and 4 people prefer Formula 5. Test the responses for uniformity.
Solution: H 0 : Uniformity or H 0 : p1  p 2  p3  p 4  p5 , where p i is the proportion that favor cola
Formula i.
Since uniformity means that we expect equal numbers in each group, we can fill the E column with fours
or just fill the next column with pi  15 .
Cola
O
1
2
3
4
5
Total
0
2
0
10
8
20
O
n
0
.1
0
.5
.4
1.0
Fo
E
0
.1
.1
.6
1.0
4
4
4
4
4
20
E
n
.2
.2
.2
.2
.2
1.0
Fe
.2
.4
.6
.8
1.0
D
.2
.3
.5
.2
0
The maximum deviation is 0.5. From the Kolmogorov-Smirnov table for n  20 , the critical values are
.20
.10
.05
.01

CV .232
.265
.294
.352
If we fit 0.5 into this pattern, it must have a p-value of less than .01, or we may simply note that if   .05 ,
0.5 is larger than .294 so we reject H 0 .
Can you see why it would be impossible to do this problem by the chi-squared method?
29
Download