Document 15930394

advertisement
252y0242 5/07/02
ECO252 QBA2
FINAL EXAM
May 6, 2002
Name KEY
Hour of Class Registered (Circle)
Note: If this is the only thing you look at before taking the final, you are badly cheating
yourself. People who used last year's final and did not read the problems carefully got very wrong
answers to them. If you can't be bothered to think, there is not much point to taking this course or
this exam.
Note: Have you reread “Things that You Should Never Do On a Statistics Exam …?” I think I could
have graded this exam by just looking for violations of these rules.
The following comments are cut and pasted from previous exams. People made the same mistakes on
this exam.
If you still think that a large p-value means that a coefficient is significant, you need a conference
with an audiologist. Further note that a p-value is a probability and can only be compared with
another probability (like the significance level).
The rule on p-value:
If the p-value is less than the significance level   reject the null
hypothesis; if the p-value is greater or equal than the significance
level, do not reject the null hypothesis.
Don’t tell me that a negative coefficient in a regression doesn't look right. There is nothing wrong
with a negative regression coefficient, unless you have a good reason to believe that it shouldn't be
negative.
How many times do I have to tell you:
1) Null hypotheses almost always contain parameters  ,  like  ,  , p, ,    , p  . They
never contain sample statistics like x , s, p, d  x , p 
2) Null hypotheses almost always contain equalities ( This means that if you want to test   5 ,
it's an alternate hypothesis and the null hypothesis is not   5 )
When you find that you have a 1-sided alternate hypothesis, you do not use 2-sided confidence
intervals or tests.
Many people were too lazy or ignorant to compute
 x y from  x
1
1
and
 y nor to get  x x
1 2
 x y . There is no way in the universe to get
from  x and  x or  x from  x .
1
1
2
2
1
1
You will always be asked to compute a sum of this sort on an exam, so figure out how to do it in
advance.
A test of multiple proportions is a  2 test! Every year I see people trying to compare more than two
proportions by a method appropriate for b) below. It doesn't work! p is defined as a difference
between two proportions, when you have more than two that definition doesn't work. Also, simply
computing the proportions and telling me that they are different is just a way of making me suspect
that you don't know what a statistical test is.
252y0242 5/07/02
A test of multiple means is an Analysis of Variance! Every year I see people trying to compare more
than two means by a method appropriate for comparing two means. It doesn't work!    is
defined as a difference between two means, when you have more than two that definition doesn't
work. Also, simply computing the means and telling me that they are different is just a way of
making me suspect that you don't know what a statistical test is.
Most people groan when I say that the final exam is cumulative. On this exam someone claimed it wasn't
cumulative enough - perhaps because the student had hardly been to class after the 3 rd exam. The questions
and the sections they covered follow.
I.
1.
2.
3.
4.
Sections I2, J3
a.
b. Section K2
c. J4
K2
II.
1.
2.
3.
4.
5.
6.
7.
8.
a. K4
b. K3
c. K3
a. D7
b. D2
c. F1
d. E3
a. D5
b. E4
c. B6 or D5
a. E1
b. D6
c. B5
G, H and I
J
a. J
b. J4
c. K2
d. J2
e. K2
f. F2
a. L2
b. L2
c. L5
d. L6
e. K5
2
252y0242 5/07/02
I. (16+ points) Do all the following.
1.
Hand in your fourth regression problem (2 points)
Remember: Y = 'Vol' = volatility (Std. Deviation of return) , X1 = 'CR' = Credit rating on a zero to 100
(per cent) scale, X2 = 'emd' = a dummy variable that is 1 if a country has an emerging market , 0 if the
country has a developed market, X3 = 'ecr' = the product of 'CR' and 'emd', X4 = 'gdp' = per capita
income in thousands of US dollars in the late '90's, X5 = 'gd-cr' = the product of 'CR' and 'gdp.' We
would expect foreign exchange rates to become less volatile as i) credit rating improves, ii) markets
become developed, and iii) per capita income rises. Remember saying 'yes' or 'no' to a question is
useless unless you cite statistical tests.
Use a significance level of 1% in this problem except when you are told otherwise.
2.
Answer the following questions:
a. For the regression of 'Vol' against 'CR', 'emd', 'ecr' 'gdp' and 'gd-cr' , what coefficients are significant
at the 5% level? Why? What about the 1% level? (3)
b. Given the comments at the beginning of this page, what signs would you expect the coefficients to
have. Do they have the expected signs? (4)
c. For the same regression, what does the ANOVA tell us? Why? (2)
d. In view of the analysis above, is there a regression that seems to work better than the one mentioned
in a) above? Why? (2)
The problem in the text says "Write a model that describes the relationship between volatility (Y) and
credit rating as two nonparallel lines, one for each type of market ……. Is there evidence to conclude
that the slope of the linear relationship between volatility (Y) and credit rating (X1) depends on market
type?"
a. What equation did you fit that answers the questions in the text? Given the coefficients that you
found, what are the two equations (and coefficients) that your equation implies for these two market
types? (3)
b. Using the 1% confidence level, what evidence can you present as to whether the slope depends on
market type? (2)
What equation was suggested by your stepwise regression. Does this seem to work as well as the one
suggested by the textbook authors? Why? If you compare the slope of the regression line relating
volatility to the credit rating for countries with gdps of 2(thousand) and 20(thousand), what seems to be
happening to the slope as per capita gdp rises? (5)
3.
4.
Solution:
2a. The printout from the computer for this equation says:
Regression Analysis
* NOTE *
gd-cr is highly correlated with other
predictor variables
The regression equation is
Vol = 40.1 - 0.227 CR + 16.9 emd - 0.332 ecr - 0.066 gdp + 0.00100 gd-cr
Predictor
Constant
CR
emd
ecr
gdp
gd-cr
s = 2.768
Coef
40.096
-0.2268
16.904
-0.3316
-0.0659
0.001000
Stdev
9.654
0.1494
9.442
0.1494
0.4245
0.005436
R-sq = 96.2%
t-ratio
4.15
-1.52
1.79
-2.22
-0.16
0.18
p
0.000
0.142
0.086
0.036
0.878
0.856
R-sq(adj) = 95.4%
The following coefficients have p-values below .05: The constant (.000), and ecr (.036). These are
significant at the 5% level. Since the p-value for the constant is the only one below .01, only the constant
is significant at the 1% level.
3
252y0242 5/07/02
2b. From the comments at the top of the page:
CR
We would expect foreign exchange rates to become less volatile as credit rating
improves, so this would be negative. (It was!)
emd
This variable equals 1 if a country does not have a developed market . We would expect
foreign exchange rates to become less volatile as markets become developed, so its sign
should be positive. (It was.)
ecr
This is a product of the two variables above. It is zero for developed markets. In emerging
markets, we might expect more sensitivity to the credit rating, since they are often thinner
markets. So this is, most likely negative. (It was)
gdp
We would expect foreign exchange rates to became less volatile as per capita income
rises. This would have a negative sign. (It was)
gd-cr
Since this variable will have its highest value for countries with high credit ratings and
high incomes, this should have a negative sign too. (Note that it does not!)
2c. The ANOVA is below:
Analysis of Variance
SOURCE
Regression
Error
Total
DF
5
24
29
SS
4596.79
183.86
4780.65
SOURCE
CR
emd
ecr
gdp
gd-cr
DF
1
1
1
1
1
SEQ SS
4388.03
55.70
152.78
0.02
0.26
MS
919.36
7.66
F
120.01
p
0.000
The most important conclusion here is that since the p-value is tiny, we can reject the null hypothesis that
the independent variables have no ability to explain variation of Y.
2d. We have a high R-squared here (96.2% - 95.4% adjusted), but the coefficients are not very significant
and one seems to have the wrong sign. The regression recommended by the authors, 'Vol' against 'CR', 'emd'
and 'ecr' seems to have as good an R-squared (96.1% - 95.7% adjusted) , significant coefficients at both the
1% and 5% level and the expected signs on the coefficients.
3a. The equation mentioned in 2d does the job. It is:
Vol = 38.6 - 0.204 CR + 18.3 emd - 0.354 ecr
For a developed market, 'emd' and 'ecr' are zero, so the equation becomes:
Vol = 38.6 - 0.204 CR
But for an emerging market 'emd' is 1 and 'ecr' is equal to 'CR', so the equation becomes:
Vol = 38.6 - 0.204 CR + 18.3(1) - 0.354 CR
Or:
Vol = 56.9 - 0.558 CR.
3b. All the p-values are below 1%, especially that of the coefficient of 'ecr'. So the slope depends on market
type.
4. The stepwise regression suggested the equation:
Vol = 56.0 - 0.523 CR + 0.00442 gd-cr
All the coefficients have a zero p-value and are thus highly significant, however 'gd-cr' has the wrong sign.
Remember that 'gd-pr' is the product of 'gdp' and 'cr'. If a country has a 'gdp' of 2, the equation becomes:
Vol = 56.0 - 0.523 CR + 0.00442(2)CR or: Vol = 56.0 - 0.514 CR
But if the gdp is 20, the equation becomes:
Vol = 56.0 - 0.523 CR + 0.00442(20)CR or: Vol = 56.0 - 0.435 CR
This seems to indicate that the volatility is less responsive to credit rating in rich countries. This doesn't
sound reasonable to me.
4
252y0242 5/07/02
II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where
applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing
appropriate statistical tests.
1. A researcher is investigating the behavior of the Dow-Jones Transportation, Industrial and Utility
averages. Data is presented below for closing numbers for 14 days in May 2001. Because the researcher
believes that the underlying distributions are not Normal, she computes rank correlations instead of standard
correlations.
For your convenience, ranks have been computed for Transportation and Industry.
a. Check the utilities for rises and falls in value, marking rises with + and falls with -. Using a statistical test,
find out if the pattern of rises and falls is random. (5)
b. Compute a rank correlation between industry and utility prices and test it for significance. (5)
c. Compute a measurement of concordance between the three series and test it for significance. Express it
on a zero to one scale. (6)
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Date
5
5
5
5
5
5
5
5
5
5
5
5
5
5
07
08
09
10
11
14
15
16
17
18
21
22
23
24
Trans
Indust
x1
r1
x2
2850.64
2865.54
2865.73
2899.76
2879.56
2874.59
2880.24
2925.50
2957.58
2978.95
3004.35
2990.97
2969.16
2951.01
1
2
3
7
5
4
6
8
10
12
14
13
11
9
10935.2
10888.5
10867.0
10910.4
10821.3
10877.3
10873.0
11215.9
11248.6
11301.7
11337.9
11257.2
11105.5
11122.4
Util
r2
7
5
2
6
1
4
3
10
11
13
14
12
8
9
x3
383.93
378.74
383.74
383.52
386.64
391.04
385.70
387.52
387.84
391.54
394.43
394.67
398.31
397.68
Solution: Data is repeated with ranks and signs added for x3 . It's remarkable how many people thought that
they could do this without ranking x3 .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
5
5
5
5
5
5
5
5
5
5
5
5
5
5
07
08
09
10
11
14
15
16
17
18
21
22
23
24
x1
r1
x2
2850.64
2865.54
2865.73
2899.76
2879.56
2874.59
2880.24
2925.50
2957.58
2978.95
3004.35
2990.97
2969.16
2951.01
1
2
3
7
5
4
6
8
10
12
14
13
11
9
10935.2
10888.5
10867.0
10910.4
10821.3
10877.3
10873.0
11215.9
11248.6
11301.7
11337.9
11257.2
11105.5
11122.4
r2
7
5
2
6
1
4
3
10
11
13
14
12
8
9
x3
383.93
378.74
383.74
383.52
386.64
391.04
385.70
387.52
387.84
391.54
394.43
394.67
398.31
397.68
r3
+
+
+
+
+
+
+
+
+
-
4
1
3
2
6
9
5
7
8
10
11
12
14
13
a) This is a runs test. The null hypothesis is randomness. r  7, n1  4, n 2  9 and n  n1  n 2  13 .
Since the numbers given on the 5% runs test table are 3 and a blank, and r  6 is above 3, we cannot reject
the null hypothesis.
5
252y0242 5/07/02
b) row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
sum
r3 d  r2  r3
r2
7
5
2
6
1
4
3
10
11
13
14
12
8
9
105
4
1
3
2
6
9
5
7
8
10
11
12
14
13
105
3
4
-1
4
-5
-5
-2
3
3
3
3
0
-6
-4
0
d2
9
16
1
16
25
25
4
9
9
9
9
0
36
16
184
From the outline:
 d  1  6184 
 1
nn  1
14 14   1
2
6
rs
2
 1
2
1104
 1  0.4044  0.596
14195 
The 14 line from the rank correlation table has
n
  .050
  .025
  .010
  .005
14
.4593
.5341
.6220
.6978
H 0 :  s  0
If you tested 
at the 5% level, reject the null hypothesis of no relationship if rs is above .4593
H 1 :  s  0
H 0 :  s  0
or, if you tested 
at the 5% level, reject the null hypothesis if rs is above .5341. So, in this
H 1 :  s  0
case we have a significant rank correlation.
c) To compute S write
row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
sum
SR 2
SR
r1
r2
r3
1
2
3
7
5
4
6
8
10
12
14
13
11
9
105
7
5
2
6
1
4
3
10
11
13
14
12
8
9
105
4
1
3
2
6
9
5
7
8
10
11
12
14
13
105
12
8
8
15
12
17
14
25
29
35
39
37
33
31
315
144
64
64
225
144
289
196
625
841
1225
1521
1369
1089
961
8757
From the outline:
Take k columns with n items in each
and rank each column from 1 to n . The
null hypothesis is that the rankings
disagree
Compute a sum of ranks SRi for each row. Then S 
 SR
2
 
 n SR
2
 8757  14 22 .52  1669 .5 ,
315 n  1k 15 3


 22 .5 is the mean of the SRi s. If H 0 is disagreement, S can be
14
2
2
checked against a table for this test. If S  S reject H 0 . Since n is too large for the table use
where SR 
 2n1   22  k n 1W  313 .8154 
W
S
1 k2
12
n
3
n


1669 .5
3
1
12
2
14
3
 14


S

n  1
1 kn
12
1669 .5
 31 .8 , where
314 15 
1
12
20034
 .8154 is the Kendall Coefficient of Concordance and
24570
must be between 0 and 1. Since this is below  .2052   5.9915 , it is significant and we reject H 0 .
6
252y0242 5/07/02
2. (Pelosi and Sandifer) A diaper company is testing three filler materials for diapers. Eight diapers were
tested with each of the three filler materials making a total of 24 diapers put on 24 toddlers. Each column
( x1 , x 2 , and x3 ) can be considered a random sample of eight taken from a Normally distributed
population. As each toddler played, fluid was injected into the diaper until the product leaked. Each number
below in x1 , x 2 , and x3 represents the capacity of the diaper. The remaining columns ( r1 , r2 , and r3 ) are
a ranking of the 24 numbers. In this entire problem we assume that the underlying distributions are Normal.
Row
1
2
3
4
5
6
7
8
x1
r1
792
790
797
803
811
791
801
791
5.0
2.0
6.0
8.5
13.5
3.5
7.0
3.5
x2
r2
809
818
803
781
813
808
805
811
12.0
17.0
8.5
1.0
15.5
11.0
10.0
13.5
x3
r3
826
813
854
843
846
847
835
872
18.0
15.5
23.0
20.0
21.0
22.0
19.0
24.0
The following are computed for you:
x  6376.00,
x  6448.00,

x
x
1
3
2
2
 6736.00,
 5197954,
n1  n2  n3  8 .

x
x
2
2
1
 5082066,
2
3
 5673944 and
a. Compute the sample variances of x1 and x3 and test the hypothesis that the population variances for
these two columns are equal. (4)
b. Assume that the variances of the populations from which x 2 and x3 come are equal and test the
hypothesis that  3 is greater than  2 i) First state your null and alternate hypotheses (2) and then test the
hypotheses using a (ii) test ratio, (iii) a critical value and (iv) a confidence interval. (6)
c. Test if the hypothesis that the means of all three populations are equal holds water (7)
d. Use a test of goodness of fit to see if x 2 has the Normal distribution. (5)
Solution:
a) As I'm sure we all know:
x1 6376
x12  n1x12 5082066  8797 2
x1 

 797 s12 
s1  7.5024

 56 .286
n1
8
n1  1
7

x2 
x3 
x

2
n2
x
n3
3

6448
 806 s22 
8

6736
 842 s32 
8
x
2
2
 n2 x2 2
n2  1
x
2
3
 n3 x32
n3  1

5197954  8806 2
 123 .714
7
s 2  11 .1227

5673944  8842 2
 318 .857
7
s 3  17.8566
If we follow 252meanx4 in the outline: Our Hypotheses are H 0 :  12   32 and H1 :  12   32 .
DF1  n1  1  7 and DF3  n3  1  7 , Since the table is set up for one sided tests, if we wish to test
H 0 :  12   32 , we must do two separate one-sided tests. First test
DF1, DF 3  F 7,7   4.99
F.025
and then test
.025
s 22
s12

s12
s 32

56 .286
 0.177 against
318 .857
318 .857
DF 3, DF1  F 7,7   4.99
. If
 5.665 against F.025
.025
56 .286
either test is failed, we reject the null hypothesis. Since 5.665 is above the table F, we reject the null
hypothesis. We really should not be using a method for comparing the means that requires equal variances.
7
252y0242 5/07/02
b) From Table 3 of the Syllabus Supplement:
Interval for
Confidence
Hypotheses
Interval
Difference
H0 :   0
  d  t  2 sd
between Two
H1 :    0
1
1
Means (
sd  s p

  1   2
n1 n2
unknown,
variances
DF  n1  n2  2
assumed equal)
Test Ratio
t
sˆ 2p 
Critical Value
d cv   0  t  2 sd
d  0
sd
n1  1s12  n2  1s22
n1  n2  2
x 2  806 , s22  123.714 , s 2  11 .1227 , x3  842 s32  318.857, s 2  17 .8566 and n 2  n3  8 .
d  x 2  x3  806  842  36
sˆ 2p 
n1  1s 22  n2  1s32
n 2  n3  2
  .05,
7123 .714   7318 .857  123 .714  318 .857

 221 .2855
14
2
221 .2855  1  1   221 .2855 .25  
1
1


n 2 n3
s d  sˆ p
=
DF  n 2  n3  2  8  8  2  14
8
8
14
t .05
 1.761
55 .321375  7.4378
H 0 :  3   2
H 0 :  2   3
H 0 :   0
(i) Hypotheses: 
or 
or 
   2  3
H1 :   0
H 1 :  3   2
H 1 :  2   3
(ii) Test Ratio: t 
d  0
36  0

 4.840 . Make a diagram of an almost-Normal curve with a mean at
7.4378
sd
zero and a 'reject' region below -1.761. Since -4.840 is in the reject region, reject the null hypothesis.
(iii) Critical Value: d cv   0  t  2 sd becomes d cv   0  t s d  0  1.7617.4378   13 .098 . Make a
diagram of an almost-Normal curve with a mean at zero and a 'reject' region below -13.098. Since -36 is in
the reject region, reject the null hypothesis.
(iv) Confidence Interval:   d  t  2 sd becomes   d  t s d  36  1.7617.4378   22 .092 since
  22 .092 contradicts H 0 :   0 , reject the null hypothesis.
Solution:
H 0 : 1   2   3
H 1 : not all means equal
  .05 
Sum
1
792
790
797
803
811
791
801
791
2
809
818
803
781
813
808
805
811
3
826
813
854
843
846
847
835
872
Sum
6376 +
6448 +
6736 +
nj
8+
8+
8+
 19560 
 x
ij
 24  n
x j
797
806
842
SS
5082066 +
5197954 +
5673944 +
19560
 815  x
24
 15953964 
x 2j
635209+
649636+
708964
= 1993809
 x
2
ij
8
252y0242 5/07/02
 x  nx  15953964 24815  12564
SSB   n x  nx  8797  8806  8842  24815
SST 
2
ij
2
j .j
2
2
2
2
2
2
2
 81993809  15941400  9072
Note that none of the items in the SS column can be negative.
Source
SS
Between
Within
Total
DF
MS
9072
2
4536
3492
12564
21
23
166
F
F.05
27.28
F 2,21  3.47 s
H0
Column means equal
Since the value of F we calculated is more than the table value, we reject the null hypothesis and conclude
that there is a significant difference between column means.
d) H 0 : N  ?, ? H 1 : Not Normal
Because the mean and standard deviation are unknown, this is a Lilliefors problem. Note that data
must be in order for the Lilliefors or K-S method to work. From the data we found that x  806 and
xx
. F t  actually is computed from the Normal table. For example
s  11 .1227 . t 
s
Fe 781   Px  781   Pz  2.25   Pz  0  P2.25  z  0  .5  .4878  .0122 .
D is the difference (absolute value) between the two cumulative distributions.
Row
1
2
3
4
5
6
7
8
x
781
803
805
808
809
811
813
818
t
-2.25
-0.27
-0.09
0.18
0.27
0.45
0.63
1.08
O
1
1
1
1
1
1
1
1
O
n
0.125
0.125
0.125
0.125
0.125
0.125
0.125
0.125
FO
0.125
0.250
0.375
0.500
0.625
0.750
0.875
1.000
Fe
0.0122
0.3936
0.4641
0.5714
0.6064
0.6736
0.7357
0.8599
D
0.1128
0.1436
0.0891
0.0714
0.0186
0.0764
0.1393
0.1401
The maximum deviation is 0.1436. The Lilliefors table for   .05 and n  8 gives a critical value of 0.285.
Since our maximum deviation does not exceed the critical value, we do not reject H 0 .
9
252y0242 5/07/02
3. Data from the previous problem is repeated. In this problem assume that the underlying distributions are
not Normal. Remember that each column is an independent sample.
Row
1
2
3
4
5
6
7
8
x1
792
790
797
803
811
791
801
791
r1
5.0
2.0
6.0
8.5
13.5
3.5
7.0
3.5
x2
r2
x3
809
818
803
781
813
808
805
811
12.0
17.0
8.5
1.0
15.5
11.0
10.0
13.5
r3
826
813
854
843
846
847
835
872
18.0
15.5
23.0
20.0
21.0
22.0
19.0
24.0
The following are computed for you:
x  6376.00,
x  6448.00,

x
x
1
3
2
2
 6736.00,
 5197954,
n1  n2  n3  8 .

x
x
2
2
1
 5082066,
2
3
 5673944 and
a. Test the hypothesis that the median of the population underlying x3 is larger than the median of the
population underlying x 2 . (6)
b. Test the hypothesis that all three columns come from populations with equal medians. (7)
c. Test the hypothesis that x 2 comes from a population with a median of 804 using either a sign test (4) or a
Wilcoxon signed rank test (5).
Solution: a) The null hypothesis is H 0 : The median of the population underlying x3 is larger than the
median of the population underlying x 2 .
The text below is largely repeated from 252review.
The data are repeated in order. The ranks that appear above appear in parentheses, since they make it easier
to find r2 and r3 , the ranks of the numbers among the 16 numbers in the two groups.
Row
1
2
3
4
5
6
7
8
x2
809
818
803
781
813
808
805
811
( r2 )
12.0
17.0
8.5
1.0
15.5
11.0
10.0
13.5
r2
5
9
2
1
7.5
4
3
6
37.5
x3
826
813
854
843
846
847
835
872
( r3 )
18.0
15.5
23.0
20.0
21.0
22.0
19.0
24.0
r3
10
7.5
15
12
13
14
11
16__
98.5
Since this refers to medians instead of means and if we assume that the underlying distribution is not
Normal, we use the nonparametric (rank test) analogue to comparison of two sample means of independent
samples, the Wilcoxon-Mann-Whitney Test. (Note that data is not cross-classified so that the Wilcoxon
Signed Rank Test is not applicable. )
H 0 :  2   3 H1 :  2   3 .
We get TL  37 .5 and TU  98.5 . Check: Since the total amount of data is 8 + 8 = 16  n , 37.5 + 98.5
nn  1 16 17 

 136 .They do.
2
2
For a 5% one-tailed test with n1  8 and n 2  8 , Table 6 says that the critical values are 52 and 84. We
accept the null hypothesis in a 1-sided test if the smaller of thee two rank sums lies between the critical
values. The lower of the two rank sums, W  37.5 is not between these values, so reject H 0 .
b) Since this involves comparing three apparently random samples from a non-normal distribution, we use a
Kruskal-Wallis test. The null hypothesis is H 0 : Columns come from same distribution or medians are
equal. If we repeat the table once again and add the rank sums we get:
must equal
10
252y0242 5/07/02
Row
1
2
3
4
5
6
7
8
x1
792
790
797
803
811
791
801
791
r1
5.0
2.0
6.0
8.5
13.5
3.5
7.0
3.5
49.0
x2
r2
809
818
803
781
813
808
805
811
12.0
17.0
8.5
1.0
15.5
11.0
10.0
13.5
88.5
x3
r3
826
813
854
843
846
847
835
872
18.0
15.5
23.0
20.0
21.0
22.0
19.0
24.0
162.5
Sums of ranks are given above. To check the ranking, note that the sum of the three rank sums is 49.5 +
88.5 + 162.5 = 300, that the total number of items is 24 and that the sum of the first n numbers is
nn  1 24 25 

 300 . Now, compute the Kruskal-Wallis statistic
2
2
2
2
2 
 12
 SRi 2 


  3n  1   12  49 .0  88 .5  162 .5   325 
H 

8
8
 nn  1 i  ni 
 24 25   8



1  36639 .5 

  75  16 .56875 . If we try to look up this result in the (8, 8, 8) section of the Kruskal-Wallis
50 
8

table (Table 9) , we find that the problem is to large for the table. Thus we must use the chi-squared table
with 2 degrees of freedom. Since  .2052   5.9915 reject H 0 .
c) We repeat the column with the alleged median next to it. H 0 :  2  804 and d  x 2  . r is the rank
of the absolute values of d and r * is the rank with signs and corrections for ties.
d

Row
r
r*
x2
d
1
2
3
4
5
6
7
8
809
818
803
781
813
808
805
811
804
804
804
804
804
804
804
804
5
14
-1
-23
9
4
1
7
5
14
1
23
9
4
1
7
4
7
1
8
6
3
2
5
36
4+
7+
1.586+
3+
1.5+
5+
89
 36 .
2
The sum of the + numbers T   26 .5 , while T   9.5 . If we check these against Table 7 for n  8 , we find
that the smaller of the two numbers must be 4 or below for a rejection in a 2-sided 5% test. Since 9.5 is
above 4, we do not reject the null hypothesis.
If, instead, we do the simpler and less powerful sign test, we simply look at how many numbers (2)
are above 804 or how many numbers (6) are above 204. Since this is a 2-sided test, the p-value is found by
checking the binomial table with p  .5 . 2Px  2  2.14453   ..28906 or 2Px  6
 21  Px  5  21  .85547   .28906 . Because this p-value is above our confidence level, we do not
reject the null hypothesis.
Our check this time is that the sum of the ranks is the sum of the numbers 1 through 8, which is
11
252y0242 5/07/02
4. (Pelosi and Sandifer) A survey on student drinking revealed the following:
Residence
Nonbinge
Infrequent
Frequent
Total
Drinker
Binge Drinker
Binge Drinker
On Campus
35
29
47
111
Off Campus
49
31
24
104
Total
84
60
71
215
a. Test the hypothesis that the proportion in each of the three drinking categories is the same regardless of
where a student lives. (7)
b. Test the hypothesis that the proportion of infrequent binge drinkers is higher off campus than on campus.
(4)
c. The researcher believes that, nationwide, the proportion of frequent binge drinkers is 30%. Test to see if
the proportion on the campus profiled above is higher. (3)
d. Find a p-value for the result in c (2)
Solution:
DF  r  1c  1  12  2
H 0 : Homogeneousor p1  p 2  p 3 
H 1 : Not homogeneousNot all ps are equal
O
On
Off
Total
1
 35

 49
84
2
3
29 47 

31 24 
60 71
Total
pr
111
.516279
104
215
.483721
1.000000
.2052   5.9915
E
1
2

On
43 .3674 30 .9767

Oft
40 .6326 29 .0233
Total 84 .0000 60 .0000
3
Total
pr

36 .6558 111 .000 .516279

34 .3442  104 .000 .483721
71 .0000
215 .000 1.000000
The proportions in rows, p r , are used with column totals to get the items in E . Note that row and column
sums in E are the same as in O . (Note that  2  9.63298  224.6329  215 is computed two different
ways here - only one way is needed.)
O2
O  E 2
E  O2
E O
Row
O
E
E
E
1
35
43.3674
8.3674
70.013
1.61442
28.2470
2
29
30.9767
1.9767
3.907
0.12614
27.1494
3
47
36.6558 -10.3442
107.002
2.91911
60.2633
4
49
40.6326
-8.3674
70.013
1.72308
59.0905
5
31
29.0233
-1.9767
3.907
0.13463
33.1113
6
24
34.3442
10.3442
107.002
3.11559
16.7714
215 215.000
0.0000
9.63298 224.6329
Since the  2 computed here is greater than the  2 from the table, we reject H 0 .
31
29
 .29808 , n2  104 .
 .26162 , n1  111 and p2 
104
111
Confidence
Hypotheses
Test Ratio
Critical Value
Interval
pcv  p0  z 2  p
p  p 0
p  p  z 2 sp
H 0 : p  p0
z
If p0  0
 p
H 1 : p  p0
p  p1  p2

p0 q 0  1 n  1 n 
If

p

0
p 
p 0  p 01  p 02
p1q1 p2 q 2
s p 

p01q 01 p02 q 02
n p  n2 p2
 p 

n1
n2
or p 0  0
p  1 1
n
n
b) We are comparing p1 
Interval for
Difference
between
proportions
q  1 p
1
1
Or use s p
2
0
2
n1  n 2
12
252y0242 5/07/02
sp 
p1q1 p2q2
.26126 .73874  .29808 .70192 



 .0017388  .0020118  .00374704  .0612419
n1
n2
111
104
29  31
n p  n2 p2 111.26162   104 .29808 
 1 1

 .27907 ,
111  104
n1  n2
111  104
 1.645 . Note that q  1  p and that q and p are between 0 and 1.
p  p1  p2  .036816 , p0 
  .05, z  z.05
 p 
p0q0

1
n1

1
n3

.27907 .72093 1111  1104 
.0037470  .061213
H0 : p  0
H 0 : p1  p2
H 0 : p1  p2  0
Our hypotheses are 
or 
or 
H1 : p 0
H1 : p1  p2
H1 : p1  p2  0
There are three ways to do this problem. Only one is needed
p  p0 .036816  0

 0.6015 Make a Diagram showing a 'reject'
(i) Test Ratio: z 
 p
.061231
region below -1.645. Since -0.6015 is above this value, do not reject H 0 .
(ii) Critical Value: pcv  p0  z  p becomes pcv  p0  z p
2
 0  1.645 .061231   .10069 . Make a Diagram showing a 'reject' region below - 0.10069.
Since p  .036816is not below this value, do not reject H 0 .
(iii) Confidence Interval:: p  p  z s p becomes p  p  z sp
2
 .036816  1.645 .0612419   0.0639 . Since
not reject H 0 .
c) From the formula table we have:
Interval for
Confidence
Hypotheses
Interval
Proportion
p  p  z 2 s p
H 0 : p  p0
H1 : p  p0
pq
n
q  1 p
sp 
p  .0639 does not contradict p  0 , do
Test Ratio
z
p  p0
p
Critical Value
pcv  p0  z 2  p
p0 q0
n
q0  1  p0
p 
H1 : p  .30 . It is an alternate hypothesis because it does not contain an equality. The null
hypothesis is thus H 0 : p  .30. The problem says that   .05 , n  215 , x  71 . so that
p0 q0
.33023 .66977 
x
71

 .03207 . This is a one-sided test and

 .33023 .  p 
n
215
n 215
z  z.01  2.327 . This problem can be done in one of three ways.
p
(i) The test ratio is z 
p  p0
p

.33023  .30
.03207
 .9426 . Make a diagram of a normal curve with a
mean at zero and a reject zone above z  z.05  1.645 . Since z  0.9464 is not in the 'reject'
zone, do not reject H 0 . We cannot say that the proportion of binge drinkers significantly above
30%.
(ii) Since the alternative hypothesis says p  .30, we need a critical value that is above .30. We
use pcv  p0  z p  .30  1.645.03207  .3528. Make a diagram of a normal curve with a
mean at .30 and a reject zone above .3528. Since p  .33023 is not in the 'reject' zone, do not
reject H 0 . We cannot say that the proportion is significantly above 30%.
13
252y0242 5/07/02
pq
. To make the 2-sided confidence interval,
n
p  p  z 2 s p , into a 1-sided interval, go in the same direction as H1 : p  .30 . We get
(iii) To do a confidence interval we need s p 
p  p  z s p . This is not a good use of our time here., but should not contradict the null
hypothesis.
d) In this case the p-value would be (from (i) above) Pz  .9426   Pz  0.94   .5  .3264  .1736 . Of
course, you could answer c) by observing that this is above   .05 .
14
252y0242 5/07/02
5. A fast food corporation wishes to predict its mean weekly sales as a function of weekly traffic flow on the
street where the restaurant is and the city in which it is located. In the first version of the study, the data is as
below. y is 'sales' in thousands, x1 is 'flow', traffic flow in thousands of cars per week, x 2 is 1 if the store
is in city 2, zero otherwise. (Use   .01) .
y
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
6.4
6.7
7.7
2.9
9.5
6.0
6.2
5.0
3.5
8.4
5.2
3.9
5.5
4.1
3.2
5.4
x1
x2
59.3
60.3
82.1
32.3
98.0
54.1
54.4
51.4
36.7
75.9
48.4
41.5
52.6
41.1
29.6
49.5
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
The following data is computed for you
y  89.6000,
x  867.200,


 x 2  7.0, n  16,  x  52023.3,
 x  ?,  y  554.760,
 x y  5358.62,  x y  ?,
 x x  338.60.
1
2
1
2
2
2
1
2
1 2
You do not need all of these on this page.
a. Compute a simple regression of sales against flow. (7)
b. Given your equation, what sales do you expect when the flow is 60.00? (1)
c. Compute R 2 (4)
d. Compute s e (3)
e. Compute s b1 ( the std deviation of the slope) and do a significance test for 1 .(3)
f. Do a prediction interval for sales when the flow is 60. (3)
Solution: a) Spare Parts Computation:
x1 867 .2
x1 

 54 .2
n
16


y
SSx1 
b1 
Sx1 y 

Sx1 y

SSx1

x
 nx12  53023 .3  16 54 .22
 x y  nx y  5358 .62  1654.25.6
1
1
 502 .3
SSy 
y
2
 ny  554 .76  16 5.62
2
 53 .00
x1 y  nx1 y
2
1
2
1
 5021 .07
89 .6
 5.6
n
16
Not that SSx1 and SSy cannot be negative!
y
x
 nx12

502 .3
 0.1000
5021 .07
b0  y  b1x1  5.6  0.1000 54.2  0.1800
Yˆ  b0  b1 x1 becomes Yˆ  0.18  0.100 x1 .
b) If x  60, Yˆ  0.18  0.10060  6.18
1
SSR 50 .23
 x y  nx y   0.100 502 .3  50.23 R  SST

 0.948 or
53 .00
 x y  nx y 
Sx y  
502 .3


 0.9481
SSx SSy  x  nx  y  ny  5021 .07 53 .00 
c) SSR  b1Sx1 y  b1
2
1
2
2
R
2
1
1
1
1
2
1
2
1
2
1
2
2
( 0  R 2  1 always!)
15
252y0242 5/07/02
se2 
d) SSE  SST  SSR  53.00  50.23  2.77
SSE 2.77

 0.1979 se  0.1979  0.4448
n2
14
( s e2 is always positive!)

e) sb21  se2 




s2
0.1979
 e 
 0.0003941
x12  nx12  SSx1 5021 .07

1
sb1  .00003941  .00628
H 0 :  1  0
b1 0.1000
Make a diagram. We accept the

 15 .92 The usual significance test is 
sb1 .00628
H 1 :  1  0
null hypothesis if our t ratio is between  t n  2  t 14  2.977 and t n 2  t 14  2.977 . Since 15.92
So tb1 

2
.005

2
.005
is not between these numbers, reject the null hypothesis and say that the slope is significant. Note that the
same people who found s b1 instead of s b0 on the last exam found s b0 instead of s b1 on this one,
proving that it is easier to copy than to think.
We have already found that if x10  60, Yˆ  0.18  0.10060  6.18
f) .
From the regression formula outline the Prediction Interval is Y0  Yˆ0  t sY , where
s 2y




 1 60  54 .22



x1  x1 2
x1  x1 2
2 1







1

s


1

0
.
1979


1


e
SSx1
5021 .07
 n  x12  nx12

 n

16




1
se2 
0
0
33 .64   1  0.1979 1.069   0.2116 So s  0.2116  0.4600 and

 0.1979 0.0625 
y
5021
.07 

Y  Yˆ  t s  6.18  2.997 0.4600   6.18  1.38 . Note that the same people who found a prediction
0
0
Y
interval instead of a confidence interval on the last exam found a confidence interval instead of a
prediction interval on this one, proving that it is easier to copy than to think.
16
252y0242 5/07/02
6. Data from the previous problem is repeated. below . (Use   .05) .
y is 'sales' in thousands, x1 is 'flow', traffic flow in thousands of cars per week, x 2 is 1 if the store is in
city 2, zero otherwise. (Use   .01) .
y
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
x1
6.4
6.7
7.7
2.9
9.5
6.0
6.2
5.0
3.5
8.4
5.2
3.9
5.5
4.1
3.2
5.4
x2
59.3
60.3
82.1
32.3
98.0
54.1
54.4
51.4
36.7
75.9
48.4
41.5
52.6
41.1
29.6
49.5
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
The following data is computed for you
y  89.6000,
x  867.200,


 x 2  7.0, n  16,  x  52023.3,
 x  ?,  y  554.760,
 x y  5358.62,  x y  ?,
 x x  338.60.
1
2
1
2
2
2
1
2
1 2
a. Do a multiple regression of price against x1 and x 2 . (12)
b. Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare
the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with
the R 2 from the previous problem. What does your F-test suggest about the significance of the coefficient
of x 2 ? (5)
c. Compute the regression sum of squares and use it in an F test to test the usefulness of this regression. (5)
d. Use your regression to predict sales in city 2 when flow is 60.00. (2)
e. Use the directions in the outline to make this estimate into a confidence interval and a prediction interval.
(4)
Solution: You should be able to compute all the sums below. The only ones that you were asked to
x 22 , which was identical to
x 2 y , which was mostly zeroes.
compute here are
x , and


x1
x2
x12

2
x 22
y2
x2 y
x1 x 2
1
2
6.4
6.7
59.3
60.3
0
0
3516.49
3636.09
0
0
40.96
44.89
379.52
404.01
0.0
0.0
0.0
0.0
3
4
5
6
7
8
9
10
11
12
13
14
15
16
7.7
2.9
9.5
6.0
6.2
5.0
3.5
8.4
5.2
3.9
5.5
4.1
3.2
5.4
89.6
82.1
32.3
98.0
54.1
54.4
51.4
36.7
75.9
48.4
41.5
52.6
41.1
29.6
49.5
867.2
0
0
0
0
0
0
0
1
1
1
1
1
1
1
7
6740.41
1043.29
9604.00
2926.81
2959.36
2641.96
1346.89
5760.81
2342.56
1722.25
2766.76
1689.21
876.16
2450.25
52023.3
0
0
0
0
0
0
0
1
1
1
1
1
1
1
7
59.29
8.41
90.25
36.00
38.44
25.00
12.25
70.56
27.04
15.21
30.25
16.81
10.24
29.16
554.76
632.17
93.67
931.00
324.60
337.28
257.00
128.45
637.56
251.68
161.85
289.30
168.51
94.72
267.30
5358.62
0.0
0.0
0.0
0.0
0.0
0.0
0.0
8.4
5.2
3.9
5.5
4.1
3.2
5.4
35.7
0.0
0.0
0.0
0.0
0.0
0.0
0.0
75.9
48.4
41.5
52.6
41.1
29.6
49.5
338.6
row
y
x1 y
 y  89.6000,  x  867.200,  x  7.0, n  16,  x  52023.3,  x  7,  y  554.760,
 x y  5358.62,  x y  35.7,  x x  338.60. Of course, many of you decided that, since
 x  7,  x  49 - after a whole year of statistics, too.
1
1
2
2
2
2
1
2
2
2
1 2
2
2
17
252y0242 5/07/02
a) First, we compute or copy from the last problem y 
and
x
x2 
2
n

 y  89.6  5.60 , x   x
n
1
16
n
1

867 .2
 54 .2 ,
16
7
 0.4375 . Then, we compute or copy our spare parts:
16
 y  ny  554 .76  165.6  53.000 *
Sx y   x y  nx y  5358 .62  16 54 .25.6  502 .30
Sx y   X Y  nX Y  35.7  160.4375 5.6  3.50
SSx1   x12  nx12  52023.3  1654.22  5021.07 *
SSx2   X 22  nX 22  7  160.43752  3.9375
and Sx x   X X  nX X  338 .60  1654.20.4375   40 .8 .
SSy 
2
2
1
1
2
2
1
2
2
1 2
1
2
1
2
* indicates quantities that must be positive. (Note that some of these were computed for the last problem. Can you believe
that some people copied the '*' from last year's exam?)
Then we substitute these numbers into the Simplified Normal Equations:
X 1Y  nX 1Y  b1
X 12  nX 12  b2
X 1 X 2  nX 1 X 2


 X Y  nX Y  b  X X
2
2
1
1
2
 
 nX X   b  X
1
2
2
2
2

 nX  ,
2
2
502 .3  5021 .07 b1  40 .8b2

 3.50   40 .8b1  3.9375 b2
which are
and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations
to solve. The choices are, essentially to multiply the second equation by 10.3619 
or to multiply it by 123.06544 
40 .8
to eliminate b2 ,
3.9375
5021 .07
to eliminate b1 . Let's try the first. The equations become
40 .8
502 .3  5021 .07 b1  40 .8b2
If we add these together, we get 466 .033  4598 .304 b1 . This means that

 36 .267   422 .766  40 .8b2
466 .033
 0.10135 . The first of the two normal equations can now have our new value substituted
4598 .304
into it to get 502 .3  5021 .07 0.10135   40.8b2 or 6.5854  40.8b2 . If we solve this for b2 , we get
b1 
b2  0.1614 . Finally we get b0 by solving b0  Y  b1 X 1  b2 X 2
 5.6  0.10135 54.2  0.1614 0.4375   0.0362 . Thus our equation is
Yˆ  b  b X  b X  0.0362  0.10135X  0.1614X .
0
1
1
2
2
1
b) In the previous problem we had SSR  b1Sx1 y  b1
2
 x y  nx y   0.100 502 .3  50.23 and
1
1
SSR 50 .23

 0.948
SST 53 .00
In this problem SSR  b1 Sx1 y  b2 Sx2 y  0.10135 502 .3  0.1614 3.5  50.34 and
R2 
R2 
SSR 50 .34

 0.950 . If we use R 2 , which is R 2 adjusted for degrees of freedom, we get, for the
SST 53 .00
first regression R 2 
n  1R 2  k  15 0.948  1  .944 , and for the second
n  k 1
14
This is evidence that the second independent variable didn’t help.
R2 
15 0.950  2  .942 .
13
18
252y0242 5/07/02
A better way of doing this is to look at (from the outline)
2
2
n  k  r  1  Rk  r  Rk 
F r ,n k r 1 

 , where k  1, r  1 and n is still 16.
2
r
 1  Rk  r 
F 1,13 
13  .950  .948 
1,13  9.07 . Our null hypothesis is essentially
 0.52 . If we check the F table, F.01


1  1  .950 
that x 2 doesn't help, and we cannot reject it.
1,13
c) The same thing can be done using ANOVA.
The ANOVA table is , for the first regression
Source
SS*
50.23
X1
DF*
1
MS*
50.23
2.77
14
Error
53.00
15
Total
. The ANOVA table is , for the second regression
Source
SS*
DF*
50.34
2
X1 , X 2
F*
255s
F.01
8.86
F*
123s
F.01
6.70
0.197
MS*
25.16
2.66
13
0.205
Error
53.00
15
Total
Both of these show that the Xs have a significant relationship to Y. However, when we combine them, the
story is not as positive.
Source
SS*
DF*
MS*
F*
F.01
50.23
1
50.23
245s
9.07
X1
0.11
1
0.11
0.53ns
9.07
X2
Error
Total
2.66
53.00
13
15
0.205
Since our computed F is smaller than the table F , we do not reject our null hypothesis that X 2 has no
effect.
d) Our regression is Yˆ  b0  b1 X 1  b2 X 2  0.0362  0.10135X 1  0.1614X 2 . If X 1  60 and X 2  1 ,
we have Yˆ  0.0362  0.1013560  0.16141  6.2786
0
13
e) From the ANOVA table, s e  0.205  0.45282 . Since k  2, t nk 1  t .005
 3.012 . The outline says
2
that an approximate confidence interval is  Y0  Yˆ0  t
se
n
 6.28  3.012
0.4522
 6.28  0.35 and an
15
approximate prediction interval is Y0  Yˆ0  t s e  6.28  3.012 0.4522   6.28  1.36. .
19
252y0242 5/07/02
7. The regression in the previous problem was run again, using data from four cities. Remember, y is 'sales'
in thousands, x1 is 'flow', traffic flow in thousands of cars per week. (Use   .05) .
First it was run in the form Y  b0  b1 X 1 with the following results.
The regression equation is
sales = 0.010 + 0.109 flow
Predictor
Constant
flow
Coef
0.0104
0.108570
s = 0.5947
Stdev
0.3583
0.006077
R-sq = 93.6%
t-ratio
0.03
17.87
p
0.977
0.000
R-sq(adj) = 93.3%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
1
22
23
SS
112.87
7.78
120.65
MS
112.87
0.35
F
319.17
p
0.000
Then it was run again in the form Y  b0  b1 X 1  b2 X 2  b3 X 3  b4 X 4 with the following results:
The regression equation is
sales = - 0.178 + 0.105 flow + 0.199 city2 + 0.675 city3 + 1.17 city4
Predictor
Constant
flow
city2
city3
city4
Coef
-0.1782
0.105002
0.1991
0.6751
1.1717
s = 0.3960
Stdev
0.2941
0.004475
0.2049
0.2745
0.2245
R-sq = 97.5%
t-ratio
-0.61
23.47
0.97
2.46
5.22
p
0.552
0.000
0.343
0.024
0.000
R-sq(adj) = 97.0%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
4
19
23
SS
117.674
2.979
120.653
SOURCE
flow
city2
city3
city4
DF
1
1
1
1
SEQ SS
112.873
0.274
0.254
4.272
MS
29.418
0.157
F
187.61
p
0.000
a) What does the ANOVA show? (2)
b) Do an F test to show if location (adding x 2 'city 2', x3 'city 3',and x 4 'city 4' all at once) improves our
explanation of weekly sales. (4)
c) We have added dummy variables for cities 2, 3 and 4. Why didn't we add one for city 1? (1)
d) What is the sales predicted for a flow of 60 in city 3. What does it mean to say that the coefficient of
'city3' is .6751. (2)
e) Explain how the model would be modified to show interaction between city and traffic flow. (2)
f) An ANOVA was run to determine if management style affected the number of sick days taken by
employees. The research was done using 3 different management styles in five separate departments. The
dependent variable was the number of sick days taken by each employee. The Minitab output follows
(Pelosi and Sandifer):
Source
Department
Mgt. Style
Interaction
Error
Total
DF
4
2
8
60
74
SS
208.187
101.440
44.293
42.000
395.920
MS
52.047
50.720
5.537
0.700
20
252y0242 5/07/02
Solution: a) The ANOVAs both show the same thing. You can test these using values from the F table if
you wish, but it is enough to say that the p-values of zero should lead us to reject the null hypothesis that
there is no relationship between Y and the Xs.
2
2
n  k  r  1  Rk  r  Rk 
b) the easiest way is to use the F test. F r ,n k r 1 

 . Since the total degrees of
2
r
 1  Rk  r 
freedom are n  1, n  24. r  3 is the number of variables added. k  1 is the number of independent
variables we started with. The original R 2 was .936 and it grew to .975. This time R-squared adjusted
19  .975  .936 
3,19  3.13 , so. Taken
 9.88 . The table says that F.05
grew, which is a good sign. F 3,19  
3  1  .975 
as a whole the dummy variables seem to be beneficial.
c) You cannot add a variable that is a linear combination of others. If x5 were added to represent 'city 1', it
would be equal to 1  x2  x3  x4 and would make computation impossible.
d) In city 3, sales would be sales = - 0.178 + 0.105 flow + 0.199 city2 + 0.675
city3 + 1.17 city4 = - 0.178 + 0.105 (60) + 0.199 (0)
+ 0.675 (1) + 1.17 (0)=6.797. The .6751 coefficient tells us that if we compare locations with
the same traffic in cities 1 and 3, the location in city 3 will have sales .6751 higher.
e) We could modify the model for interaction by adding 3 new variables: X 5  X1 X 2 , X 6  X1 X 3 and
X 7  X1 X 4 .
f) Finish the Minitab table and explain what it shows. In particular, citing numbers in the table or from the F
table, does management style make a difference in the number of sick days that employees take and does
what department management style is changed in seem to have an effect? (5)
Source
DF
SS
MS
F
F.05
Department
4
208.187
52.047
Mgt. Style
2
101.440
50.720
74.353 F 4,60  2.53
72.531 F 2,60  3.15
7.910 F 8,60  2.10
Interaction 8
44.293
5.537
Error
60
42.000
0.700
Total
74
395.9
This is your basic Minitab printout for 2-way ANOVA. To finish it divide the MSs by the Error (Within)
mean square (0.700) to get the values of F, look up the corresponding values of F on the table and declare
those lines that have Fs that are larger than the table F to imply a rejection of a null hypothesis. In this case
all the F's that we computed are larger than the table Fs, so we conclude (i) that department affects the
number of sick days, (ii) that management style affects the number of sick days and (iii) that changes of
management style have different effects in different departments.
21
252y0242 5/07/02
8. Extra Credit - Questions on correlation.
Go back to problem 7. Use the R-sq in the first regression to find the correlation between sales and
traffic flow (0). Use the same significance level that you used on that problem.
a. Test the correlation between sales and flow for significance. (3)
b. Test the hypothesis that the correlation between sales and traffic flow is .9. (4)
c. Compute the partial correlation between sales and 'city4' , rY 4.123 . (2)
d. It's no secret that not all the coefficients of the second regression in problem 7 were very
significant. I checked for (multi)collinearity by doing the following Minitab command:
MTB > corr c2 c3 c4 c5
Correlations (Pearson)
city2
city3
city4
flow
-0.228
-0.256
0.313
city2
city3
-0.243
-0.329
-0.194
These results were also printed out as:
Matrix CORR1
flow
flow
1.00000
city2 -0.22820
city3 -0.25624
city4
0.31345
city2
-0.22820
1.00000
-0.24254
-0.32918
city3
-0.25624
-0.24254
1.00000
-0.19389
city4
0.31345
-0.32918
-0.19389
1.00000
Explain what collinearity is and whether it is likely that collinearity influenced my results. (3)
e. Aczel reports the following regression results:
MTB > REGRESS 'export' on 4 'm1' 'lend' 'price' 'exch';
SUBC > DW.
………… (Most of output omitted)
Durbin-Watson statistic = 2.58
If n  67 , explain, telling your significance level, what we ought to conclude from this printout. (3)
Solution: a) Since R 2  .936 , r  .936  .9675 . If we want to test H 0 : xy  0 against H1 : xy  0 and
r

x and y are normally distributed, we use t n  2  
sr
r
1 r
n2
2

.9675
1  .936
24  2
 17 .9458 . Compare this with
 t n2 2  t .22
025  2.074 . Since 17.9458 does not lie between these two values, reject the null hypothesis.
b) H 0 : xy  .9 . If we are testing H 0 : xy   0 against H 1 : xy   0 , and  0  0 , we use Fisher's z1  1  r  1  1  .9675  1
z  ln 
transformation. Let ~
  ln 
  ln 60 .538   2.05164 . This has an approximate
2  1  r  2  1  .9675  2
1  1 0
mean of  z  ln 
2  1 0
sz 
1

n3
 1  1  .9  1
  ln 
 2  1  .9   2 ln 19   1.47222 and a standard deviation of

~
n  2 
z   z 2.05164  1.47222
1
 0.218218 , so that t


 2.443 . Compare this with
22
sz
0.218218
 t n2 2  t .22
025  2.074 . Since 2.443 does not lie between these two values, reject the null hypothesis.
c) The example given in the outline is from the computer printout, rY23.12 
t 32
t 32  df
df  n  k  1 and k is the number of independent variables. The printout says
, where
22
252y0242 5/07/02
The regression equation is
sales = - 0.178 + 0.105 flow + 0.199 city2 + 0.675 city3 + 1.17 city4
Predictor
Constant
flow
city2
city3
city4
rY 4.123 
Coef
-0.1782
0.105002
0.1991
0.6751
1.1717
5.22 2  .589
5.22 2  19
Stdev
0.2941
0.004475
0.2049
0.2745
0.2245
t-ratio
-0.61
23.47
0.97
2.46
5.22
p
0.552
0.000
0.343
0.024
0.000
d) Collinearity is a condition that occurs when highly correlated independent variables appear in a
regression. The result is large standard deviations and low values of t. Though we do have some
insignificant coefficients, the cause is unlikely to be collinearity, since the correlations are not high.
.
d) This is a Durbin-Watson test and we are given the Durbin-Watson statistic, DW  0.258 Use a DurbinWatson table with n  67 and k  4 to fill in the diagram below.
0
+
 0
dL
+
?
dU
+
 0
2  0
+
4  dU
+
?
4 dL
+
 0 4
+
If you used the 5% table, you got d L  1.48 and dU  1.73. If you used the 1% table, you got
d L  1.32 and dU  1.57. In either case the given value of DW  0.258 falls well below d L , indicating
positive autocorrelation.
23
Download