4/16/03 252y0331 ECO252 QBA2 Name KEY

advertisement
4/16/03 252y0331
(Page layout view!)
ECO252 QBA2
THIRD HOUR EXAM
April 21 - 22, 2003
Name KEY
Hour of Class Registered (Circle)
I. (30+ points) Do all the following (2points each unless noted otherwise).
1.
Which of the following components in an ANOVA table are not additive?
a) Sum of squares.
b) Degrees of freedom.
c) *Mean squares.
d) It is not possible to tell.
TABLE 11-1
Psychologists have found that people are generally reluctant to transmit bad news to
their peers. This phenomenon has been termed the “MUM effect.” To investigate the
cause of the MUM effect, 40 undergraduates at Duke University participated in an
experiment. Each subject was asked to administer an IQ test to another student and
then provide the test taker with his or her percentile score. Unknown to the subject,
the test taker was a bogus student who was working with the researchers. The
experimenters manipulated two factors: subject visibility and success of test taker,
each at two levels. Subject visibility was either visible or not visible to the test
taker. Success of the test taker was either visible or not visible to the test taker.
Success of the test taker was either top 20% or bottom 20%. Ten subjects were randomly
assigned to each of the 2 x 2 = 4 experimental conditions, then the time (in seconds)
between the end of the test and the delivery of the percentile score from the subject
to the test taker was measured. (This variable is called the latency to feedback.) The
data were subjected to appropriate analyses with the following results.
Source
df
SS
MS
F
p-value
Subject visibility 1
1380.24
1380.24
4.26
0.043
Test taker success 1
1325.16
1325.16
4.09
0.050
Interaction
1
3385.80
3385.80
10.45
0.002
Error
36
11,664.00
324.00
Total
39
17,755.20
2.
Referring to Table 11-1, at the 0.01 level, what conclusions can you draw from the analyses?
a) At the 0.01 level, subject visibility and test taker success are significant predictors of
latency feedback.
b) At the 0.01 level, the model is not useful for predicting latency to feedback.
c) *At the 0.01 level, there is evidence to indicate that subject visibility and test taker
success interact.
d) At the 0.01 level, there is no evidence of interaction between subject visibility and test
taker success.
Explanation: I have no idea what all this means, but a look at the table p-values shows that
interaction is significant and the others aren’t. We reject the null hypothesis of no significant
interaction because the p-value is below 1%.
4/16/03 252y0331
TABLE 11-2
An airline wants to select a computer software package for its reservation system.
Four software packages (1, 2, 3, and 4) are commercially available. The airline will
choose the package that bumps as few passengers, on the average, as possible during a
month. An experiment is set up in which each package is used to make reservations for
5 randomly selected weeks. (A total of 20 weeks was included in the experiment.) The
number of passengers bumped each week is obtained, which gives rise to the following
Excel output:
ANOVA
Source of Variation
SS
Between Groups
212.4
Within Groups
136.4
Total
348.8
df
MS
3
F
P-value
F crit
8.304985
0.001474
3.238867
8.525
3.
Referring to Table 11-2, the total degrees of freedom is
a) 3
b) 4
c) 16
d) *19
Explanation: The total of 20 weeks seems to mean 20 observations, and 20 – 1 = 19. Since I
wasn’t sure that this was right, I checked that 8.525 = 136.4/16. So the df column has 3 + 16 = 19.
4.
Referring to Table 11-2, the between group mean squares is
a) 8.525
b) *70.8
c) 212.4
d) 637.2
Explanation: 212.4/3 = 70.8.
5.
Referring to Table 11-2, at a significance level of 1%,
a) there is insufficient evidence to conclude that the average numbers of customers bumped
by the 4 packages are not all the same.
b) there is insufficient evidence to conclude that the average numbers of customers bumped
by the 4 packages are all the same.
c) *there is sufficient evidence to conclude that the average numbers of customers bumped
by the 4 packages are not all the same.
d) there is sufficient evidence to conclude that the average numbers of customers bumped by
the 4 packages are all the same.
Explanation: The p-value of .001474 is below .01 so we reject the null hypothesis of no
difference between means.
2
4/16/03 252y0331
6.
The Journal of Business Venturing reported on the activities of entrepreneurs during the
organization creation process. As part of a designed study, a total of 71 entrepreneurs were
interviewed and divided into 3 groups: those that were successful in founding a new firm (n1 =
34), those still actively trying to establish a firm (n2 = 21), and those who tried to start a new firm
but eventually gave up (n3 = 16). The total number of activities undertaken (e.g., developed a
business plan, sought funding, looked for facilities) by each group over a specified time period
during organization creation was measured. The objective is to compare the mean or median
number of activities of the 3 groups of entrepreneurs. The underlying distribution is not known to
be Normal, nor is it likely that the columns have similar variances. Identify the method that would
be used to analyze the data..
a) Two-way ANOVA
b) Friedman Test for differences in medians.
c) *Kruskal-Wallis Rank Test for Differences in Medians
d) One-way ANOVA F test
Explanation: ANOVA is not appropriate because ANOVA requires a Normal parent population
and that the samples come from populations with equal variances. The Friedman test requires that
the data be cross – classified, but there is no evidence that it is, especially since the columns are
of unequal length..
7.
The Y-intercept (b0) represents the
a) *predicted value of Y when X = 0.
b) change in estimated average Y per unit change in X.
c) predicted value of Y.
d) variation around the sample regression line.
Explanation: The fitted equation to predict Y using X is Yˆ  b0  b1 X . If X  0 ,
Yˆ  b0  b1 0  b1 .
8.
The least squares method minimizes which of the following?
a) SSR
(Gets larger when SSE gets smaller.)
b) *SSE
c) SST
( This is basically the variance of our y data, we can’t really control it.)
d) All of the above
Explanation: SSE is defined as the sum of the squares of the distances Y  Yˆ , where the
prediction is Yˆ  b  b X . and Y represents the actual value of the dependent variable. OLS is a

0

1
process by which we find a line that minimizes the sum of the squared vertical distances between
points, Ŷ on the line and the actual value of Y , as tabulated for every actual value of X .
TABLE 13-2
A company has the distribution rights to home video sales of previously released movies and it would like to be able to estimate the
number of units it can expect to sell.. It has data on 30 movies given the box office grass in (millions of) dollars (in column 1)and the
number of (thousands of) units of home videos that it sold ( in column 2). The data are not shown here but you may want to know that
the largest box gross on the list was $58.51 (millions) and the biggest seller sold 365.14 (thousand) units. Since they didn’t know
what they were doing, they did 2 regressions, but only one is correct.
—————
4/16/2003 12:07:53 AM
————————————————————
Welcome to Minitab, press F1 for help.
MTB > Retrieve "C:\Documents and Settings\RBOVE\My Documents\Drive D\MINITAB\2x03317.MTW".
3
4/16/03 252y0331
Retrieving worksheet from file: C:\Documents and Settings\RBOVE\My Documents\Drive
D\MINITAB\2x0331-7.MTW
# Worksheet was saved on Wed Apr 16 2003
Results for: 2x0331-7.MTW
Explanation: This is the wrong regression since the company
wants to predict units sold from box office gross.
MTB > regress c1 1 c2
Regression Analysis: Gross versus Units
The regression equation is
Gross = - 8.52 + 0.168 Units
Predictor
Constant
Units
Coef
-8.519
0.16795
S = 9.424
SE Coef
3.308
0.01941
R-Sq = 72.8%
T
-2.58
8.65
P
0.016
0.000
R-Sq(adj) = 71.8%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
28
29
SS
6647.4
2486.7
9134.1
Unusual Observations
Obs
Units
Gross
27
365
45.55
28
219
46.62
30
255
58.51
MS
6647.4
88.8
Fit
52.81
28.20
34.24
F
74.85
SE Fit
4.60
2.23
2.73
P
0.000
Residual
-7.26
18.42
24.27
St Resid
-0.88 X
2.01R
2.69R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Explanation: This is the right regression.
MTB > regress c2 1 c1
Regression Analysis: Units versus Gross
The regression equation is
Units = 76.5 + 4.33 Gross
Predictor
Constant
Gross
S = 47.87
Coef
76.54
4.3331
SE Coef
11.83
0.5008
R-Sq = 72.8%
T
6.47
8.65
P
0.000
0.000
R-Sq(adj) = 71.8%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
28
29
Unusual Observations
Obs
Gross
Units
23
23.1
280.79
27
45.6
365.14
30
58.5
254.58
SS
171500
64154
235654
Fit
176.76
273.91
330.07
MS
171500
2291
F
74.85
SE Fit
9.45
17.22
23.05
P
0.000
Residual
104.03
91.23
-75.49
St Resid
2.22R
2.04R
-1.80 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
9.
Referring to Table 13-2, what is the number of units you would expect to sell of a movie that had
a box office gross of $25 million?
Solution: The equation reads Units = 76.5 + 4.33 Gross or Yˆ  76.5  4.33 X , so if
we substitute 25(million), we get Yˆ  76.5  4.3325  184.75 (thousand) units.
4
4/16/03 252y0331
10. Referring to Table 13-2, what percentage of the total variation in units sold is explained by box
office gross?
Solution: R 2  72.8%
11. Referring to Table 13-2, interpret the p value for testing whether  1 exceeds 0.
a) *There is sufficient evidence (at the  = 0.01) to conclude that box office gross (X) is a
useful linear predictor of units sold. (Y).
b) There is insufficient evidence (at the  = 0.05) to conclude that box office gross (X) is a
useful linear predictor of units sold (Y).
c) Box office gross (X) is a poor predictor of units sold (Y).
d) For every $1 million increase in box office gross, we expect the number of units sold to
increase by 0.
12. Referring to Table 13-2, give a 95% confidence interval for  1 and interpret the interval.
Solution: DF  n  2  30  2  28 . This can also be read from the ANOVA as the error degrees of
freedom. The formula for a confidence interval for the slope is   b  t nk 1 s . From the
1
printout b1  4.331 and s b1
1

2
b1
28
 2.048 , so 1  4.331  2.048 0.5008 
 0.5008 . The t table says t .025
 4.331  1.026 , so we can say that there is a 95% probability that the slope lies between 3.305 and
5.357.
13. A hospital does a test of goodness of fit to see if arrivals per hour follow a Poisson distribution
with a mean of 2. The data are below. The f column has been copied from the Poisson table.
The O and the E columns both add to 480.
O
x
O
Fo
Fe
f
E
D
n
0
1
2
3
4
5
6
7
8
9+
65
130
125
96
37
11
0
0
0
16
64.961
129.922
129.922
86.615
43.308
17.323
5.774
1.650
0.412
0.116
0.13542
0.40625
0.66667
0.86667
0.94375
0.96667
0.96667
0.96667
0.96667
1.00000
0.13534
0.40601
0.67668
0.85712
0.94735
0.98344
0.99547
0.99890
0.99976
1.00000
0.0000817
0.0002440
0.0100103
0.0095427
0.0035980
0.0167703
0.0288003
0.0322373
0.0330963
0.0000040
0.135417
0.270833
0.260417
0.200000
0.077083
0.022917
0.000000
0.000000
0.000000
0.033333
0.135335
0.270671
0.270671
0.180447
0.090224
0.036089
0.012030
0.003437
0.000859
0.000241
a) What method is the hospital using to check goodness of fit? (1)
b) What is the critical value it uses if   .10 ? (2)
c) Does it accept the null hypothesis H 0 : Poisson2 ? Why? (1).
Solution: a) This is the Kolmogorov – Smirnov (K-S)method for checking the null hypothesis that a
given distribution (in this case Poisson2 ) fits data. b) According to the K-S table, the 10% value for
n  480 is CV 
1.22
n

1.4884
 .05569 . c) Since the largest difference is 0.0330963, and it is
480
less than .05569, we must accept the null hypothesis that the distribution is Poisson2 .
5
4/16/03 252y0331
14. Since the administrator mistrusts the results, the analysis is redone. The data are below.
x
O
0
1
2
3
4
5
6+
65
130
125
96
37
11
16
E O
E  O2
-0.03920
-0.07792
4.92208
-9.38544
6.30752
6.32272
-8.04784
0.0015
0.0061
24.2269
88.0865
39.7848
39.9768
64.7677
E
64.961
129.922
129.922
86.615
43.308
17.323
7.952
E  O  2
E
0.00002
0.00005
0.18647
1.01699
0.91866
2.30777
8.14467
O2
E
65.039
130.078
120.264
106.402
31.611
6.985
32.193
a) What is the value of the test statistic this time?(2).
b) What is the table value against which we test the test statistic? (1)
c) Do we accept or reject the null hypothesis this time? Why? (1)
d) Why are there three fewer rows this time?(1)
e)The first method is supposedly more powerful than the second method. Do these results illustrate
this fact? Why?(1)
Solution: a)  2 

E  O2
E


O2
 n  12.5727 or 12.5746. n  480  b) This is a ChiE
squared test and the degrees of freedom are the number of rows minus 1, which gives us 6, so
 .2106   10 .6446 . c) Since the table value is below our computed chi-squared, we reject the null
hypothesis. d) We have to merge all rows with E below 5, e) In this large-sample case, it is not more
powerful. A more powerful method is more likely to reject the null hypothesis when it is false, the fact
that chi-squared rejected, when K-S did not leads us to believed that chi-squared is more powerful.
15. Turn in your computer problems 2 and 3 marked as requested in the Take-home. (5 points, 2 point
penalty for not doing.)
6
4/16/03 252y0331
ECO252 QBA2
Third EXAM
April 21, 22 2003
TAKE HOME SECTION
Name: _________________________
Social Security Number: _________________________
Please Note: computer problems 2 and 3 should be turned in with the exam. In problem 2, the 2 way
ANOVA table should be completed. The three F tests should be done with a 5% significance level and you
should note whether there was (i) a significant difference between drivers, (ii) a significant difference
between cars and (iii) significant interaction. In problem 3, you should show on your third graph where the
regression line is.
II. Do the following: (22+ points) assume a 5% significance level. Show your work!
1. The Lees, in their book on statistics for Finance majors, ask about the relationship of gasoline prices  y 
to crude oil prices x  and present the following data for the years 1979 - 1988. (To get you started the sum
of the crude price column is 211.16 and the sum of the numbers squared in the crude price column is
4936.3.
Obs
No
1
2
3
4
5
6
7
8
9
10
Gas Price
Crude Price
(cents/gal)(dollars/barrel)
86
119
133
122
116
113
112
86
90
90
12.64
21.59
31.77
28.52
26.19
25.88
24.09
12.51
15.40
12.57
Just to make things interesting, change the tenth number in the Gas Price column by adding the 3rd digit of
your Social Security number to it. For example, Seymour Butz’s SS number is 123456789 and he will
change 90 to 93. This should not change the results by much.
Show your work – it is legitimate to check your results by running the problem on the computer, but I
expect to see hand computations for every part of this problem.
a. Compute the regression equation Y  b0  b1 x to predict the price of gasoline on the basis of
crude oil prices. (3)
b. Compute R 2 . (2)
c. Compute s e . (2)
d. Compute s b1 and do a significance test on b1 (2)
e. In 1978, the price of crude oil was 63 cents a gallon. (Note posted correction – Price was $9.00
per barrel) Using this create a prediction interval for the price of gasoline for that year. Explain why a
confidence interval for the price is inappropriate. (3)
Solution: Working with the original data, we get the following table. Important: The solution that follows
is for the original data. All other solutions are sketched in 252y033app. Especially note the first version if
you added a decimal point to the gas price. Make sure that you were graded fairly. Much of the solution was
unchanged if you moved the decimal point.
7
4/16/03 252y0331
Row
gaspr
i
y
x
x2
xy
86
119
133
122
116
113
112
86
90
90
1067
12.64
21.59
31.77
28.52
26.19
25.88
24.09
12.51
15.40
12.57
211.16
159.77
466.13
1009.33
813.39
685.92
669.77
580.33
156.50
237.16
158.00
4936.30
1087.04
2569.21
4225.41
3479.44
3038.04
2924.44
2698.08
1075.86
1386.00
1131.30
23614.82
1
2
3
4
5
6
7
8
9
10
crprice
y2
 x  211 .16 ,  y  1067 ,  x
n  10 ,
2
7396
14161
17689
14884
13456
12769
12544
7396
8100
8100
116495
 4936 .30 ,
 xy  21614 .82 and  y
Spare Parts Computation:
SSx 
x
 x  211 .16  21.116
x
 477 .4454
 y  1067  106 .7
y
 1084 .048
n
n
a) b1 
Sxy 
10
10
SSy 
2
 116495 .
 nx 2  4936.30  1021.1162
 xy  nx y  23614 .82  1021.116 106 .7
y
2
 ny  116495  10 106 .7 2
2
 2646 .10  SST
 xy  nxy  1084 .048  2.2705
 x  nx 477 .4454
Sxy

SSx
2
2
2
b0  y  b1 x  106 .7  2.2705  21.116  58.7561
So Y  b  b x becomes Yˆ  58.7516  2.2705 x .
0
1
SSR 2461 .33
 xy  nxy   2.2705 1084 .048   2461 .33 So R  SST

 .9302
2646 .10
 xy  nxy 
Sxy 
1084 .048 


 .9302
SSxSSy  x  nx  y  ny  477 .4454 2646 .10 
b) SSR  b1 Sxy  b1
2
2
2
R
2
2
2
2
2
2
c) d) SSE  SST  SSR  2646 .10  2461 .33  184 .77
or s e2 
or s e2 
or s e2 
SSy  b1 Sxy

n2
 y
2
 y
2
n2
 ny
2
 ny
n2
2
  x
 b12
n2
s e  23 .09625  4.8059
SSE 184 .77

 23 .09625
n2
8
  xy  nxy  2646 .10   2.2705 1084 .048 

 23 .09625
1  R SST  1  R  y
n2
s e2 
 ny 2  b1
2
2
or
2
 nx
2

2


8

1  .9302 2646 .10   23.0872
8
2646 .10  2.2705 2 477 .4454 
8
 23 .0985
So
( s e2 is always positive!)
8
4/16/03 252y0331

d) s b21  s e2 


 23 .09625

 0.048375 sb1  0.048375  0.21994 .
X 2  nX 2  477 .4454

b  10
 H 0 :  1   10
The outline says that to test 
use t  1
. Remember  10 is most often zero – and if
H
:



s b1
10
 1 1
1

the null hypothesis is false in that case we say that 1 is significant. So t 
b1  10
2.2705
=10.3233.

sb1
0.21994
Make a diagram of an almost Normal curve with zero in the middle and, if   .05 , the rejection zones
are above t n2  t 8  2.306 and below  t n2  t 8  2.306 . Since our computed t-ratio is, at


.025
2
.025
2
10.3233, well into the upper reject region, we reject the null hypothesis that the coefficient is not
significantly different from zero and say that b1 is significant.
Note that the F test shown next shows the same information. In fact, the two F s in the table are just the
squares of our t s.
Source
SS
DF
MS
F
F.05
Regression
2461.33
1
Error (Within)
Total
184.77
2646.10
8
9
2461.33
106.57
F 1,8  5.32 s
23.09625
e) Our equation says that Yˆ  58.7516  2.2705 x , so, if x  9.00 , Yˆ  58.7516  2.27059.00  79.19 .
the Prediction Interval is Y  Yˆ  t s , where
0
1
. sY2  s e2  
n

X 0  X 
0
Y
2

  23 .09625  1  9  21 .116   1  32 .507 s  32 .507  5.702 ,

1
Y

 10
477 .4454  
X 2  nX 2


so that the prediction interval is Y0  Yˆ0  t sY  79.19  2.3065.702  79.19  13.15 . (The actual price
of 86 cents was well within this range.)
This is the only appropriate interval because the confidence interval for Y gives an average value for many
years in which the oil price was $9.00, but it was only $9.00 in 1978 as far as we know.
-------------------------------------------------------------Minitab results follow: Results for: 2x0331-9.MTW
2


Regression Analysis: y versus x
The regression equation is
y = 58.8 + 2.27 x
Predictor
Constant
x
S = 4.806
Coef
58.756
2.2705
SE Coef
4.887
0.2199
R-Sq = 93.0%
T
12.02
10.32
P
0.000
0.000
R-Sq(adj) = 92.1%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
8
9
SS
2461.3
184.8
2646.1
MS
2461.3
23.1
F
106.57
P
0.000
Unusual Observations
Obs
x
y
Fit
SE Fit
Residual
2
21.6
119.00
107.78
1.52
11.22
R denotes an observation with a large standardized residual
St Resid
2.46R
9
4/16/03 252y0331
2. According to the Lees, the daily rate of return for a stock in percent is summarized in the following table.
Add the second to last digit of your Social Security number to the 50. For example, if Seymour Butz’s SS
number is 123456789, he will change the 50 to 58 and the total to 203.
x interval
F0
Fe
fe
z interval
O O
E
D
n
below -3
-3 to -2
-2 to -1
-1 to 0
0 to 1
1 to 2
above 2
20
25
30
50
40
25
5
195
From the data we find that x  0 and s  1.6. On the basis of this test to see if the data follows a Normal
distribution by a) a chi-squared test (5) and b) a Lilliefors test. (5)
Hint: To find the probability of being on a given interval, you need values of z . You must use the sample
mean and variance I gave you in place of  and  . Once you find the values of z you need, put the
probability in the f e column. (You will have to round the values of z to numbers like 1.25 to use the
Normal table – Round cliff-hangers like 1.875 to 1.87.) I showed you in class how to do the f e column
using the Fe column, but in any case, for example, the item in the first row of the f e column is Px  3
and the second is P3  x  2 .
If you have f e , you should be able to get E and do a chi-squared
test, remembering that we lost degrees of freedom using the data to estimate the mean and variance. You
will probably need to fill in the entire table to do the Lilliefors test. Explain why this has to be a Lilliefors
test rather than a K-S test.
Solution: a) I started by filling in the Table 1 below. As in the two Normal distribution examples covered in
x x x0

class, I computed t 
and called it z . For example P3  x  2
s
1.6
 2  0
3 0
 P
z
 P 1.87  z  1.25   P 1.87  z  0  P 1.25  z  0  .4693  .3944  .0749
1.6 
 1.6
This can go in the f e column to compute the expected number of items between -3 and -2. In order to
speed up computations, what I actually did was to compute the Fe column first. For example I got
 2  0

F 2  Px  2  P  z 
 Pz  1.25   Pz  0  P 1.25  z  0  .5  .3944  .1056 , which
1.6 

I needed as the second item in the Fe column. Then, as I demonstrated in class, P3  x  2
 F 2  F 3  .1056  .0307  .0749 . So I now had the f e  .0749 . This is the probability or proportion
of data in -3 to -2. Since there are 195 items, we want E  .0749 195   14 .60 items in the second row of
the E column. Using the O and E columns, we can do a chi-squared test as in Table 2 below.
Table 1:
O
x interval
F0
Fe
fe
z interval
O
E
D
n
below -3
below -1.87
20 .1026 .1026 .0307 .0307
5.99 .0719
-3 to -2
-1.87 to -1.25 25 .1282 .2308 .1056 .0749 14.60 .1252
-2 to -1
-1.25 to -0.62 30 .1538 .3846 .2676 .1620 31.59 .1170
-1 to 0
-0.62 to 0
50 .2564 .6410 .5000 .2324 45.32 .1410
0 to 1
0
to 0.62 40 .2051 .8461 .7324 .2324 45.32 .1137
1 to 2
0.62 to 1.25 25 .1282 .9743 .8944 .1620 31.59 .0799
above 2
above 1.25
5 .0256 1.0000 1.0000 .1056 20.59
0
195
1.0000 195.00
10
4/16/03 252y0331
Table 2:
Row
1
2
3
4
5
6
7
O
20
25
30
50
40
25
5
195
E
5.99
14.60
31.59
45.32
45.32
31.59
20.59
195.00
E O
E  O2
-14.01
-10.40
1.59
-4.68
5.32
6.59
15.59
0.00
196.280
108.160
2.528
21.902
28.302
43.428
243.048
E  O  2
E
32.7680
7.4082
0.0800
0.4833
0.6245
1.3747
11.8042
54.5429
O2
E
66.7780
42.8082
28.4900
55.1633
35.3045
19.7847
1.2142
249.543
So, we can use either the last column or the second-to-last column to tell us that  2  249.543 195
 54 .543 . Because we have estimated 2 parameters from the data our degrees of freedom are 7 – 1 - 2
=4, and  .2054   9.4877 . Since the value of our computed test statistic at 54.543 is much larger than 9.4877,
we must reject the null hypothesis that x is Normal. .
b) This is much easier. The Lilliefors test is a single – purpose test to see if a distribution with unknown
population mean and variance is Normal. We cannot use K-S because we need to specify all parameters in
advance to use it. Go back to Table 1 and compute the D column by looking at the absolute values of
differences between the F0 and Fe columns. You will find that the maximum D is .1410. According to
our Lilliefors table, the 5% critical value is
.886
n

.886
 .0634 . Since our maximum D is larger than
195
this, reject the null hypothesis.
11
4/16/03 252y0331
3) (Extra credit) The Lees present the following data. Actually I should have said that the numbers represent
student salaries, and the researcher wants to know if years of work experience make a difference.
Years of Work Experience
Region
1
2
3
1
16
19
24
2
21
20
21
3
18
21
22
4
13
20
25
To vary the results, change the 25 by adding 1/10 of the third digit of your SS number. For example, if
Seymour Butz’s SS number is 123456789, he will change the 25 to 25.3.
a) Do a 2-way ANOVA on these data and explain what hypotheses you test and what the conclusions are.
(6)
b) What other method could we use on these data to see if years of experience makes a difference while
allowing for cross-classification? Under what circumstances would we use it? Try it and tell what it tests
and what it shows.
Solution:
a) 2-way ANOVA (Blocked by region) ‘s’ indicates that the null hypothesis is rejected.
Region Exper 1 Exper 2 Exper 3 sum count mean
Sum of squares
x i.. n i
SS
x1
x2
x3
x i.
x i2.
1
16.0
19.0 24.0
59.0 3 19.67
1193
386.78
2
21.0
20.0 21.0
62.0 3 20.67
1282
427.11
3
18.0
21.0 22.0
61.0 3 20.33
1249
413.44
4
13.0
20.0 25.0
58.0 3 19.33
1194
373.78
Sum
68.0
+80.0 +92.0
=240.0 12 20.00
4918
1601.11
4
+4
+4
= 12
nj

17.0
1190
289
x j
SS
x 2j
From the above
x
20.0
+1602
+400
 x  240 ,
n  12 ,
SST 
 x
 x  240  20 .
n
12
 n x
SSR   n x
SSC 
2
j j
2
i i.
20.0  x
=4918
=1218
23.0
+2126
+529
 x
2
ij
2
ij
 4918 ,
x
2
i.
 1601 .11
x
2
.j
 1 2 1 8and
 n x  4918  12 20 2  4918  4800  118 .
2
 n x  41218   12 20 2  4872  4800  72 . This is SSB in a one way ANOVA.
2
 n x  31601 .11  12 20 2  4803 .33  4800  3.33
2
( SSW  SST  SSC  SSR  52.0 )
F.05
Source
SS .
DF
MS .
F.
Rows (Regions)
3.33
3
1.11
0.156
Columns(Experience)
72.00
2
36.00
5.062
F 3,6  4.76 ns
F 2,6  5.14 s
H0
Row means equal
Column means equal
Within (Error)
42.67
6
7.112
Total
118.00
11
So the results characterized by years of experience (column means) are significantly different.
12
4/16/03 252y0331
b) In general if the parent distribution is Normal use ANOVA, if it's not Normal, use Friedman or
Kruskal-Wallis. If the samples are independent random samples use 1-way ANOVA or Kruskal
Wallis. If they are cross-classified, use Friedman or 2-way ANOVA.
So the other method that allows for cross-classification is Friedman and we use it if the underlying
distribution is not Normal.
The null hypothesis is H 0 : Columns from same distribution or H 0 : 1   2   3 . We use a Friedman test
because the data is cross-classified by region. This time we rank our data only within rows. There are c  3
columns and r  4 rows.
Original Data
Ranked Data
Exper
Exper
Exper
Exper
Exper
Exper
1
2
3
1
2
3
Region
x1
x2
x3
r1
r2
r3
1
2
3
4
16
21
18
13
19
20
21
20
24
21
22
25
1
2.5
1
1.
5.5
2
1
3
2.
8
3
2.5
2
3.
10.5
SRi
To check the ranking, note that the sum of the three rank sums is 5.5 + 8 + 10.5 = 24, and that the
rcc  1 434
SRi 

 24 .
sum of the rank sums should be
2
2
 12

SRi2   3r c  1
Now compute the Friedman statistic  F2  
 rc c  1 i


 

 12
5.52  82  10 .52

 434
  344   14 30.25  64  110 .25   48  51.125  48  3.125 .

If we check the Friedman Table for c  3 and r  4 , we find  F2  2 has a p-value of .431 and  F2  3.5
has a p-value of .273. Since our number lies between these we can conclude that if   .05 , the p-value is
higher and we cannot reject the null hypothesis.
13
Download