4/26/02 252y0232 ECO252 QBA2 Name

advertisement
4/26/02 252y0232
(Page layout view!)
ECO252 QBA2
THIRD HOUR EXAM
April 18, 2002
I. (10+ points) Do all the following;
Name
KEY
Hour of Class Registered (Circle)
MWF TR 10 12 12:30 2:00
1. Hand in your computer printouts for problems 2 and 3.(5 points – 3 point penalty for not handing in).
remember that the ANOVA printout must be completed, using a 5% significance level, for full credit. I
should be able to tell what is tested and what are the conclusions.
2. a. In particular, is the interaction between car and driver significant? Which numbers made you think
that? (2)
b. Create two confidence intervals for the difference between the means for drivers 2 and 3, one that is
valid alone, and one that is valid simultaneously with other similar intervals. Do these intervals show a
significant difference between these two means? Why? (4)
Solution: The only parts of the solution to computer problem 2 that you need are:
Tabulated Statistics
ROWS: car
1
2
3
4
ALL
COLUMNS: driver
1
2
3
ALL
42.000
32.000
30.667
31.333
34.000
25.000
28.000
45.000
24.667
30.667
12.667
29.333
28.333
54.667
31.250
26.556
29.778
34.667
36.889
31.972
CELL CONTENTS -- mpg:MEAN
MTB > twoway 'mpg''car''driver'
Two-way Analysis of Variance
Analysis of Variance for mpg
Source
DF
SS
car
3
590.3
driver
2
76.1
Interaction
6
3227.9
Error
24
336.7
Total
35
4231.0
MS
196.8
38.0
538.0
14.0
To complete the printout, divide through the MS column by MSW  14 and place the results in the in the
F column. Then look up the corresponding values of F in 5% lines on the F table.
Source
DF
SS
MS
F.05
H0
F
car
3
590.3 196.8 14.057s F 3,24  3.01 Car means identical
driver
2
76.1
38.0
2.714ns
.05
2,24
F.05
6,24
F.05
 3.40 Driver means identical
Interaction 6 3227.9 538.0 38.428s
 2.51 No interaction
Error
24
336.7
14.0
Total
35 4231.0
The first and the third null hypotheses are rejected. a) Since 38.428 is larger than 2.51, we reject the
hypothesis that there is no interaction and say that there is significant interaction.
b) Drivers 2 and 4 are in the columns. There are R  4 rows, C  3 columns and P  3 measurements per
cell. Of course RC ( P  1)  432  24, the number of degrees of freedom for 'within' or 'error.'
From the outline, we have for Bonferroni confidence intervals for column means
2MSW
 1   2  x1  x2   t RC P 1
. This becomes, for m  1,
2m
PR
2MSW
214 .0
 30 .667  31 .250   2.064
  0.583   2.064 2.333
PR
12
 0.58  3.15 This indicates no significant difference.
 2   3  x2  x3   t 24
2
4/18/02 252y0232
For Scheffe intervals for column means use  1   2  x1  x2  
C  1FC 1, RCP 1 2MSW
PR
. So
2F.052,24 214   .583  23.40 2.333   .583  3.98 . This
 2   3  (30 .667  31 .250 ) 
12
indicates no significant difference.
c. In your income and education regression,
(i) Explain what coefficients are significant and why? (2)
(ii) What income would you predict for someone with 3 years of education? (1)
(iii) Make a confidence interval for the income of someone with 3 years of education using some
of the information generated by Minitab below. (2)
Descriptive Statistics
Variable
Educ
N
32
Mean
12.000
Median
12.000
TrMean
12.071
Variable
Educ
Min
4.000
Max
20.000
Q1
8.000
Q3
16.000
StDev
4.363
SEMean
0.771
Column Sum of Squares
Sum of squares (uncorrected) of Educ
=
5198.0
Solution: The relevant output is:
Regression Analysis
The regression equation is
Income = 5078 + 732 Educ
Predictor
Constant
Educ
Coef
5078
732.4
s = 2855
Stdev
1498
117.5
R-sq = 56.4%
t-ratio
3.39
6.23
p
0.002
0.000
R-sq(adj) = 55.0%
i) So we can state that, since the p-values are both below .05, that both coefficients are significant at the 5%
level.
ii) The regression can be written as Income  5078  732 Educ or Income  5078  732 .4 Educ . So
Income  5078  732 (3)  7274 or Income  5078  732 .4(3)  7275 .2 .
1
iii) From the outline The Confidence Interval is  Y0  Yˆ0  t sYˆ , where sY2ˆ  s e2  
n

X 0  X 2
 X
2
 nX 2





 1
3  12 2   8151025  1  81   1373758 .6 and s  1373758 .6  1172 .07 .
 2855 2  


Yˆ
 32 5198  3212 2 
 32 590 


30
If we use t n2  t .025
 2.042 , we get Y0  7274  2.04211172.07  7274  2393 .


2
Please note the following from the 252 home page:
The rule on p-value:
If the p-value is less than the significance level (alpha) reject the null hypothesis; if the pvalue is greater than or equal to the significance level, do not reject the null hypothesis.
Significance
This is a topic that was covered under hypothesis tests. Probably the first reference I made to this
was even earlier when I said that a parameter is significant if it is not zero. I later said that a null
hypothesis often says that a parameter or a difference between parameters is insignificant. If a
result is significant we reject the null hypothesis.
2
To put this more generally, a result is (statistically) significant if it is larger or smaller than would
be expected by chance alone. Thus in the case of a regression coefficient the measure of
significance could be the p-value, which tells us the probability of getting our actual result or
something more extreme if we assume that the population value of the coefficient is zero. If the pvalue is small (below our significance level), then it is unlikely that our assumption about the
coefficient is correct and we say that the coefficient is significant (or significantly different from
zero). Of course, the various hypothesis tests that we have discussed here are also often ways of
proving significance.
3
4/18/02 252y0232
II. Do at least 4 of the following 5 Problems (at least 10 each) (or do sections adding to at least 40 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where
applicable. Never say 'yes' or 'no' without a statistical test.
1. On the following pages there are printouts from two computer problems.
a. The One-way ANOVA Problem ( Albright, Winston, Zappe - abbreviated): An automobile parts producer
has instituted an employee empowerment program in five plants. Random samples of employees in
each plant are asked to rate the success of the program on a 1 to 10 scale. 10 being the highest
rating. They want to know if the program is being implemented with equal success at each plant
and are thus looking to see if there is a significant difference between mean ratings at each plant.
They are assuming that the results are distributed according to Normal distributions with similar
variances.
(i) Indicate what hypothesis was tested, what the p-value was and whether, using the p-value, you
would reject the null if () the significance level was 5% and () the significance level was 1%.
Explain why. Does this mean that the success was equal in all plants? (3)
(ii) Do a 'normal' and a Scheffe confidence interval   .05  for the difference between the
means in the two plants that were the least successful. Do these intervals indicate a difference in
the success of the program between these two plants? Why? (4.5).
(iii) The printout gives 95% confidence intervals for the means for each plant. Find the numbers
for the confidence interval for 'Midwest.' Why is this interval smaller than the others? (2.5)
(iv) I would question whether ANOVA was appropriate for this problem because there is no
evidence that the underlying populations are Normally distributed. What method would I prefer for
this problem? (1)
One-way ANOVA problem
Worksheet size: 100000 cells
MTB > RETR 'C:\MINITAB\2X0232-1.MTW'.
Retrieving worksheet from file: C:\MINITAB\2X0232-1.MTW
Worksheet was saved on 4/ 9/2002
MTB > print c1-c5
Data Display
Row
south
midwest
n-east
s-west
west
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
7
1
8
7
2
9
3
8
5
7
4
7
6
10
3
9
10
8
4
3
2
7
7
5
10
10
6
3
5
2
6
4
5
2
7
8
7
7
5
5
5
4
3
4
5
5
3
3
3
5
5
6
4
7
10
7
6
6
7
4
3
7
8
9
10
4
10
4
6
6
6
6
6
3
4
8
6
2
4
5
6
4
7
4
3
5
4
7
6
4
4
4/18/02 252y0232
MTB > AOVOneway c1 c2 c3 c4 c5.
One-Way Analysis of Variance
Analysis of Variance
Source
DF
SS
Factor
4
46.24
Error
85
393.55
Total
89
439.79
Level
south
midwest
n-east
s-west
west
N
11
26
14
18
21
Pooled StDev =
Mean
5.545
6.000
4.429
6.556
5.048
2.152
MS
11.56
4.63
StDev
2.697
2.623
1.158
2.229
1.532
F
2.50
p
0.049
Individual 95% CIs For Mean
Based on Pooled StDev
---+---------+---------+---------+--(----------*----------)
(------*------)
(---------*--------)
(--------*-------)
(-------*-------)
---+---------+---------+---------+--3.6
4.8
6.0
7.2
Solution: a) (i) All one-way ANOVAs test for equality of the means of the populations represented by the
columns, so H 0 is 1  2  3  4  5 . The p-value is 4.9%, so we reject the null hypothesis at the
5% significance level, but not the 1% level. If we reject the null hypothesis we say that the success level
was not the same at all the plants.
(ii) The Northeast and the West plants were the least successful. From the outline if we desire a single
interval and we want the difference between means of column 1 and column 2.
1   2  x1  x2   t n  m  s
2
85
 3   5  x.3  x5   t .025
s
1
1
, where s  MSW  4.63  2.152 . This becomes

n1 n 2
1 1

 4.429  5.048   1.988 4.63 0.11905  0.619  1.988 0.742   0.699  1.475
14 21
If we desire intervals that will simultaneously be valid for a given confidence level for all possible intervals

1
1 
between column means, use 1   2  x1  x2   m  1Fm1,n m   s
, which becomes

 n
n 2 
1


1 1 
 3   5  x3  x5   5  1F4,85  4.63

 4.429  5.048   5  12.48  4.63 0.11905  0.619  2.338

14 21 

since both these intervals include zero, there is no significant difference.


(iii) If we use the 'normal' formula for the difference between two means, we get 1  x1   tn m s
2
1
n1
1
 6.000  0.839 . It is the smallest interval because we divide the pooled
26
standard deviation by the square root of n2 , which is the largest of all the sample sizes.
 2  6.000  1.988 4.63
b. The Regression Problem: This relates the number of shares in thousands to the age of board members of
a corporation.
(i) Looking at significance tests and the value of R-squared, how successful is this regression?
Why? Why shouldn't this surprise you? (3)
(ii) Note that c1 contains 'shares' and that c4 contains predicted values of 'shares.' Add a regression
line to the graph. (1)
(ii) What equation relates the number of shares owned to the age of the board member? How many
shares does it say that we should expect a 83-year old board member to own? Would you take this
seriously? Why? (2)
5
4/18/02 252y0232
Regression Problem
Worksheet size: 100000 cells
MTB > RETR 'C:\MINITAB\2X0232-5.MTW'.
Retrieving worksheet from file: C:\MINITAB\2X0232-5.MTW
Worksheet was saved on 4/12/2002
MTB > echo
MTB > Execute 'C:\MINITAB\252SOLS3.MTB' 1.
Executing from file: C:\MINITAB\252SOLS3.MTB
MTB > #252sols3
MTB > print c1 c2
Data Display
Row
shares
age
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
7.9
66.4
29.7
60.5
10.4
28.7
86.9
121.1
35.3
2.8
74.4
13.1
9.1
19.1
18.8
3.1
96.5
47.0
31.1
53
60
69
49
67
68
46
62
63
55
57
71
66
70
66
57
54
64
56
MTB > plot c1*c2 (plot omitted)
MTB > regress c1 on 1 c2 c3 c4
Regression Analysis
The regression equation is
shares = 153 - 1.86 age
Predictor
Constant
age
s = 33.01
Coef
152.95
-1.860
Stdev
64.82
1.061
R-sq = 15.3%
t-ratio
2.36
-1.75
p
0.031
0.098
R-sq(adj) = 10.3%
Analysis of Variance
SOURCE
DF
SS
MS
F
p
Regression
1
3348
3348
3.07
0.098
Error
17
18522
1090
Total
18
21870
Unusual Observations
Obs.
age
shares
Fit Stdev.Fit
Residual
8
62.0
121.10
37.65
7.70
83.45
R denotes an obs. with a large st. resid.
MTB >
MTB >
SUBC>
SUBC>
SUBC>
SUBC>
MTB >
St.Resid
2.60R
plot c4*c2 (plot omitted)
plot c4*c2 c1*c2;
symbol;
type 3 1;
color 8 9;
overlay.
end
6
4/18/02 252y0232
C4
100
50
0
50
60
70
age
Solution: b ) (i) This is a very unsuccessful regression - surely the author could have found a better
predictor of the number of shares owned than age! R 2 is very small on a zero to one scale and the p-value
for the slope is above 5%. The regression seems to say that the number of shares owned declines as the
board member gets older. I see no reason why this should be true.
(ii) To add a regression line, just connect the x's.
(iii) The regression equation says shares = 153 - 1.86 age. If a board member is 83
shares  153  1.8683  1.38. Of course, you can't own negative shares, and the fact that the oldest board
member is 71 might lead us to feel that we have exceeded our competence. Basically the low R 2 leaves us
unsure whether we should take any of its results seriously.
7
4/18/02 252y0232
2. A researcher believes that the data below has a Normal distribution with a mean of 80 and a standard
x   x  80

deviation of 5. For your convenience the values of z 
are computed for you.

5
a. Use a chi-squared test to find out if the distribution is correct. (9)
b. Is there a better way to do this problem than chi-squared? Why? Do it. (5)
c. Assume that, instead of using population means given above, we actually checked the data and
found that x  80 and s  5. How would this change what we did in a)? (1)
d. Assume that, instead of using population means given above, we actually checked the data and
found that x  80 and s  5. How would this change what we did in b)? (1)
Observed
x interval z interval
Frequency
below 74
below -1.2
23
74-78
-1.2 to -0.4
53
78-82
-0.4 to 0.4
52
82-86
0.4 to 1.2
46
86-90
1.2 to 2.0
24
above 90
above 2.0
2
200
Solution: H 0 : N 80,5 a) We find the cumulative distribution of z , Fe , and use it to find the frequency
f e .We then find E  f e n , where n  200 . Fe is the cumulative probability. In the first column
Fe 0.4  Pz  0.4  .5  P0.4  z  0  .5  .1554  .3446 and Fe 1.2  Pz  1.20 
 .5  P0  z  1.2  .5  .3849  .8849 . f e is the difference between successive values of Fe . For
example, P1.2  z  0.4  .Pz  0.4  Pz  1.2  .3446  .1151  .2295
x interval
Fe
fe
z interval
O
E
below 74
below -1.2
23
.1151 .1151 23.02
74-78
-1.2 to -0.4
53
.3446 .2295 45.90
78-82
-0.4 to 0.4
52
.6554 .3108 62.16
82-86
0.4 to 1.2
46
.8849 .2295 45.90
86-90
1.2 to 2.0
24
.9772 .0923 18.46
above 90
above 2.0
2
1.0000 .0228
4.56
200
1.0000 200.00
We use the O and E to do a conventional chi-squared analysis. In the right column is the short-cut
method.
Row
O
E
1
2
3
4
5
6
23
53
52
46
24
2
200
23.02
45.90
62.16
45.90
18.46
4.56
200.00
E  O
0.0200
-7.1000
10.1600
-0.1000
-5.5400
2.5600
0.0000
E  O2
0.000
50.410
103.226
0.010
30.692
6.554
E  O  2
E
0.00002
1.09826
1.66064
0.00022
1.66260
1.43719
5.85893
O2
E
22.9800
61.1983
43.5006
46.1002
31.2026
0.8772
205.8589
For the Chi-Squared Method, we could have had to merge two cells, because the first E was below
2
5. However, the small value of the first row term in the E  O  E column indicates that there was no need
to do this. We thus have 6 - 1 = 5 degrees of freedom. The value of Chi-squared that we computed is
3.80704 or 203.8071-200 = 3.8071. From the Chi-squared table  .2055  11 .0705 . This is more than our
computed  2 , so do not reject H 0 .
8
4/18/02 252y0232
b) For most problems where the population mean and standard deviation are given the best method is
Kolmogorov-Smirnov
Fe is copied from part a) and O is made into a Cumulative distribution Fo by dividing through by
n  200 and adding down the column. D is the difference between the two cumulative distributions.
Row
1
2
3
4
5
6
O
O
23
53
52
46
24
2
0.115
0.265
0.260
0.230
0.120
0.010
n
Fo
Fe
0.115
0.380
0.640
0.870
0.990
1.000
0.1151
0.3446
0.6554
0.8849
0.9772
1.0000
D
0.0001000
0.0354000
0.0154000
0.0149000
0.0128000
0.0000000
1.36
 0.096 . This is less than the
For the Kolmogorov-Smirnov Method the 5% critical value is
200
maximum value of D , which is .0354, so reject H 0 .
c) If the sample mean and standard deviation have been computed from the data, we would lose 2 degrees
of freedom. We would go ahead exactly as before until the time came to look up chi-squared which would
now have 3 degrees of freedom.
d) We would go ahead exactly as in b) until the time came to use a table. We would find our critical value
on the Lilliefors table.
9
4/18/02 252y0232
3. (Weirs) A maker of stain removers is testing the effectiveness of four different formulations of a new
product. Columns represent formulations 1-4 of the product and the 6 rows represent different stains
(Creosote, crayon, motor oil, grape juice, ink, coffee). Each formulation is rated on a 1-10 scale for its
effectiveness.
Stain
1
2
3
4
5
6
Sum
Count
Form 1 Form 2 Form 3 Form 4
1
7
2
5
9
10
7
5
4
6
1
4
9
7
4
5
6
8
4
4
9
4
2
6
38
42
20
29
6
6
6
6
Sum of
Squares
296
314
sum count
15
4
31
4
15
4
25
4
22
4
21
4
129
24
24
Sum of squares
79
255
69
171
132
137
843
90
a. Assume that the parent distribution is Normal and compare the mean ratings for the four formulations,
noting the fact that it is cross-classified. Use   .10 . (14) Note: If you wish to ignore that the fact that the
data is classified by stain type, indicate this now and compare the column means assuming that the data is
four independent random samples from a Normal distribution.(10). (   .10 )
b. Using the same significance level, assume that Formulation 1 is the current formula and use Scheffe
intervals to see which formulations have mean ratings that differ significantly from the current formulation.
(4)
c. Using a significance level of 15%, repeat the analysis in b) using Bonferroni intervals. (4)
Solution: If the parent distribution is Normal use ANOVA, if it's not Normal, use Friedman or
Kruskal-Wallis. If the samples are independent random samples use 1-way ANOVA or Kruskal
Wallis. If they are cross-classified, use Friedman or 2-way ANOVA.
a) 2-way ANOVA (Blocked by stain) ‘s’ indicates that the null hypothesis is rejected.
Stain Form 1 Form 2 Form 3 Form 4
sum
count mean Sum of squares
x i.. n i
SS
x1
x2
x3
x4
x i.
x i2.
1 1.0000
7.0 2.0000
5.000
15.000 4 3.750
79
14.063
2 9.0000 10.0 7.0000
5.000
31.000 4 7.750
255
60.062
3 4.0000
6.0 1.0000
4.000
15.000 4 3.750
69
14.063
4 9.0000
7.0 4.0000
5.000
25.000 4 6.250
171
39.063
5 6.0000
8.0 4.0000
4.000
22.000 4 5.500
132
30.250
6 9.0000
4.0 2.0000
6.000
21.000 4 5.250
137
27.562
Sum 38.0000 +42.0 +20.0000 +29.000 =129.000 24 5.375
843 185.062
+6
+6
+6
=24
nj 6

7.0
3.3333
4.8333
5.375  x
x. j 6.3333
SS 296.000 +314.0 +90.0000 +143.000 =843.000
x 2j 40.1111 +49.0 +11.1111 +23.3611=123.5833
 x  129 ,
From the above
x
 x  129  5.375 .
n
24
n  24 ,
SST 
 x
 x
2
ij
2
ij
 843 .0 ,
x
2
i.
 185 .062
x
2
.j
 123.5833 and
 n x  843 .0  24 5.375 2  843 .0  693 .375  149 .625 .
2
 n x  n x  6123 .5833   245.375   48.125 . This is SSB in a one way ANOVA.
SSR   n x  n x  4185 .062   24 5.375   46 .875 ( SSW  SST  SSC  SSR  54.625 )
SSC 
2
2
j j
2
i i.
2
2
2
10
4/18/02 252y0232
Source
SS
DF
MS
F
F.10
F 5,15  2.27 s
F 3,15  2.49 s
Rows (Stains)
46.875
5
9.375
2.574
Columns(Formulas)
48.125
3
16.041
4.405
H0
Row means equal
Column means equal
Within (Error)
54.625
15
3.642
Total
149.625
23
So the formulations (column means) are significantly different.
One way ANOVA (Not blocked by stain)
Source
SS
Columns(Formulas)
( SSW  SST  SSB  .91.0 )
DF
MS
F
48.125
3
16.042
3.161
F.10
H0
F 3,20  2.38 s
Column means equal
Within (Error)
101.500
20
5.075
Total
149.625
23
Once again, the formulations (column means) are significantly different.
b) This resembles problem F2. The formulas are given in the outline. R  6 is the number of rows, C  4
is the number of columns and P  1 is the number of observations per cell. Note that if P  1 , replace
RC P  1 with R  1C  1  6  14  1  15 , which is the error degrees of freedom above. The
Scheffe’ formula for column means is 1  2  x1  x2  
becomes 1  2  x1  x2  
C  1FC 1,R 1C 1 2MSW
R
C  1FC 1, RCP 1 2MSW
PR
 x1  x2  
, which
3  1F3,15 2MSW
6
23.642 
 x1  x2   6.046  x1  x2   2.46 . Since the formula works
6
regardless of the column number, we get the following 3 contrasts.
1  2  6.383  7.000   2.40  0.67  2.46
 x1  x2   22.49 
1  3  6.383  3.333   2.40  3.00  2.46
1  4  6.383  4.833   2.40  1.50  2.46
Since the error part of the formula (2.46) is larger than the difference between sample means in two cases,
there is no significant difference there. However, Formulation 3 is significantly worse than Formulation 1.
2MSW
c) The Bonferroni formula for column means is  1   2  x1  x2   t RC P 1
. Note that if
2m
PR
P  1 , replace RC P  1 with R  1C  1  6  14  1  15 . If   .15 and we are doing m  3
intervals,

2m
 .15 23  .025. The Bonferroni formula becomes
23.642 
2MSW
15 2 MSW
 x1  x2   2.131
 x1  x2   t.025
6
R
R
 x1  x2   1.10 . Once again, substitute the sample means.
1  2  6.383  7.000   1.10  0.67  1.10
1  3  6.383  3.333   1.10  3.00  1.10
1  4  6.383  4.833   1.10  1.50  1.10
According to these smaller, and probably more appropriate, intervals, both Formulations 3 and 4 are
significantly worse than Formula 1. We will stick with the old formula.
1  2  x1  x2   tR 1C 1
2m
11
4/18/02 252y0232
3(ctd.). d. Actually, when Weirs presented the data in the previous problem, repeated below, he assumed
that the underlying distribution was not Normal. So compare the median ratings using a 10% significance
level. (6)
Stain
1
2
3
4
5
6
Sum
Count
Sum of
Squares
Form 1 Form 2 Form 3 Form 4
1
7
2
5
9
10
7
5
4
6
1
4
9
7
4
5
6
8
4
4
9
4
2
6
38
42
20
29
6
6
6
6
296
314
sum count
15
4
31
4
15
4
25
4
22
4
21
4
129
24
24
Sum of squares
79
255
69
171
132
137
843
90
Solution: d) This becomes a Friedman test. We rank the data within rows.
H 0 : Columns from same distribution
Row
1
2
3
4
Row
1
2
1
2
3
4
5
6
1
9
4
9
6
9
7
10
6
7
8
4
2
7
1
4
4
2
5
5
4
5
4
6
1
2
3
4
5
6
Sum
1
4
3
4
2.5 4
4
3
3
4
4.0 2
17.5 21
3
4
2
3
2
1
1
2.5
1
2
1.5 1.5
1
3.0
8.5 13.0
There are r  6 rows and c  4 columns. Check: The rank sums must add to r
cc  1
45
6
 60 .
2
2
Since 17.5 + 21 + 8.5 + 13.0 = 60, we are all right. The Friedman Statistic is
12
12
1
 F2 
SR 2  3r c  1 
17.5 2  21 2  8.5 2  13 2  365  988 .5  90  8.85 .
r c c  1
645
10
The Friedman Table has no values for c  4 and r  6 , so we use a chi-squared table with c  1  3
degrees of freedom. Since   .10 , the table gives us a critical value of 6.2514. . Since our computed chisquared is larger than the table value, we reject H 0 . Note - if you were told to use a significance level of
.01, you would have gotten a critical value of 11.3449 and would not have rejected the null hypothesis.
 


12
4/18/02 252y0232
4. Use methods appropriate to testing goodness of fit.
a. Test the hypothesis that the numbers below came from a Normal distribution. Use a 10%
significance level. (6) note that Minitab says the following:
mean
303.000
stdev
64.0878
n
9.00000
b. Test the hypothesis that the numbers below came from a Normal distribution with a mean of
240 and a standard deviation of 50 (6)
238 222 272 280 292 301 333 357 432
Solution: a) H 0 : N  ?, ? H 1 : Not Normal
Because the mean and standard deviation are unknown, this is a Lilliefors problem. Note that data must be
in order for the Lilliefors or K-S method to work. From the data we found that x  303 .00 and
xx
. F t  actually is computed from the Normal table. For example
s  64.0878 . t 
s
Fe 222   Px  222   Pz  1.26   Pz  0  P 1.26  z  0  .5  .3962  .1038 . D is the
difference (absolute value) between the two cumulative distributions.
O
x
O
Row
Fe
FO
t
n
1
222 -1.26
0.1038
1
0.111111
0.11111
2
238 -1.01
0.1562
1
0.111111
0.22222
3
272 -0.48
0.3156
1
0.111111
0.33333
4
280 -0.36
0.3594
1
0.111111
0.44444
5
292 -0.17
0.4325
1
0.111111
0.55556
6
301 -0.03
0.4880
1
0.111111
0.66667
7
333
0.47
0.6808
1
0.111111
0.77778
8
357
0.84
0.7995
1
0.111111
0.88889
9
432
2.01
0.9778
1
0.111111
1.00000
D
0.007311
0.066022
0.017733
0.085044
0.123056
0.178667
0.096978
0.089389
0.022200
The maximum deviation is 0.17867. The Lilliefors table for   .10 and n  9 gives a critical value of
0.249. Since our maximum deviation does not exceed the critical value, we do not reject H 0 .
b) H0 :N 240 ,50  H 1 : Not N 240 ,50 
Because the population mean and standard deviation are known, this is a Kolmogorov-Smirnov problem.
x
z
.

Row
1
2
3
4
5
6
7
8
9
x
222
238
272
280
292
301
333
357
432
z
-0.36
-0.04
0.64
0.80
1.04
1.22
1.86
2.34
3.84
Fe
0.3594
0.4840
0.7389
0.7881
0.8508
0.8888
0.9686
0.9925
0.9999
O
1
1
1
1
1
1
1
1
1
O
n
0.111111
0.111111
0.111111
0.111111
0.111111
0.111111
0.111111
0.111111
0.111111
FO
0.11111
0.22222
0.33333
0.44444
0.55556
0.66667
0.77778
0.88889
1.00000
D
0.248289
0.261778
0.405567
0.343656
0.295244
0.222133
0.190822
0.103611
0.000100
The maximum deviation is 0.405567. The Kolmogorov-Smirnov table for   .10 and n  9 gives a critical
value of 0.387. Since our maximum deviation exceeds the critical value, reject H 0 .
13
4/18/02 252y0232
5. (Weirs) The following data gives years of membership and numbers of shares (in thousands) owned for 8
board members of our corporation. Numbers are the dependent variable and years is the independent
variable.
Data Display
Row
share
years
1
2
3
4
5
6
7
8
Total
300
408
560
252
288
650
600
522
3580
6
12
14
6
9
13
15
9
84
years
shares
squared squared
36
90000
144
166464
196
313600
36
63504
81
82944
169
422500
225
390000
81
272484
968 1771496
Note that n  8 and that you will have to compute
 xy .
a. Compute the regression equation Y  b0  b1 x to predict thousands of shares owned on the basis
of age. (6)
b. On the basis of your regression, how many thousands of shares do you expect to be owned by
someone who has been on the board for 3 years ? (1)
c. Compute R 2 . (4)
d. Compute s e . (3)
e. Compute s b0 and do a significance test on b0 .(4)
f.. Do an interval that shows the average number of shares that would be owned by someone who
has been on the board for 3 years. (3)
g. Using your SST etc., put together the ANOVA table (6)
 x  84 ,  y  3580 ,  x  968 and  y  1771496 . After all this time, trying to get
 x by squaring  x or to get  xy by multiplying  x by  y is inexcusable.
We compute  x y  40788 (See next page)
2
Solution:
2
2
Spare Parts Computation:
x 84
x

 10 .5
n
8
SSx 
y
 3198 .0

 y  3580  447 .25
n
8
 x  nx  968  810.5  86.0
Sxy   xy  nx y  40788  810 .5447 .5
SSy 
2
y
2
2
2
 ny  1771496  8447 .52
2
 169446 .0  SST
a) b1 
Sxy

SSx
 xy  nxy  3198 .0  37.1860
 x  nx 86.0
2
2
b0  y  b1 x  447 .5  37.1860 10.5  57.047
b) Y  b0  b1 x becomes Yˆ  57.047  37.1860 x . So if x 0  3, then Yˆ0  57.047  37.1860x0
 57.047  37.1860 3  168 .605 is the number of shares that we forecast for someone who has been on the
board for 3 years.
14
4/18/02 252y0232
SSR 118920 .8
 xy  nxy   37.1860 3198 .0  118920 .8 R  SST

 .7018 or
169446 .0
 xy  nxy 
Sxy 
3198 .0
( 0  R  1 always!)


 .7018
SSxSSy  x  nx  y  ny  86.0169446 .0
c) SSR  b1 Sxy  b1
2
2
2
R
2
2
2
2
2
2
2
s e2 
d) SSE  SST  SSR  168446 .0  118920 .8  50525 .2
s e2 
s e2 
s e2 
SSy  b1 Sxy

n2
 y
2
  xy  nxy  169446 .0  37.1860 3198 .0

 8420 .862 or
 ny 2  b1
1  R SST  1  R  y
2
2
n2
 y
2
n2
2
 ny
n2
 ny
2
SSE 50525 .2

 8420 .867 or
n2
6
  x
 b12
2
 nx
2
2

6


1  .7018 169446 .0  8421 .466 or
6
So s e  8420 .867  91 .7653
n2
( s e2 is always positive!)
e) H 0:  0  0 H 1 :  0  0
2 

1 x2 
  8420 .867  1  10 .5   11847 .9
s b20  s e2  
sb0  11847.9  108.8
 n SS 
8
86 .0 
x 


b   00 b0  0 57 .047
t 0


 0.524 Assume that   .05 and Make a diagram. Show an almost
s b0
s b0
108 .8
normal curve and that the 'reject region is below  t n2  t 6  2.447 or above t n2  t 6  2.447 .
. 2
Since 0.524 is between these values, do not reject H 0 . Conclude that
f) We found in b) that if x  3 , Yˆ  168.605 .
0
1
s 2yˆ  s e2  
0
n

s y0
. 2
.025
0
.025
is insignificant.
0
2
2

  s 2  1  x 0  x    8420 .867  1  3  10 .5   6560 .44
e
8
n
86.0 
SS x 
x 2  nx 2 



So Y0  Yˆ0  t  2 s y0  168 .605  2.447 80 .9966   168  198 .
 6560.44  80.9966.
x 0  x 
2


g) From the previous page or above, SSR  118920 .8 , SST  169446 .0 and SSE  50525 .2 . H 0 is that
there is no relation between Y and X .
Source
SS
DF
MS
F
F.05
Regression
118920.8
1
Error (Within)
Total
50525.2
169446.0
6
7
118920.8
14.122
F 1,6   5.99 ns
8420.87
Since the table F is larger than the computed F, do not reject H 0 .
Appendix: Computation of column sums.
Row
i
1
2
3
4
5
6
7
8
Sum
share
years
C3
C4
y
x
x2
xy
y2
300
408
560
252
288
650
600
522
3580
6
12
14
6
9
13
15
9
84
1800
4896
7840
1512
2592
8450
9000
4698
40788
90000
166464
313600
63504
82944
422500
360000
272484
1771496
36
144
196
36
81
169
225
81
968
C5
15
4/18/02 252y0232
It's worthwhile looking at the computer output for this exercise.
MTB > RETR 'C:\MINITAB\2X0232-4.MTW'. (Retrieves previously stored data)
Retrieving worksheet from file: C:\MINITAB\2X0232-4.MTW
Worksheet was saved on 4/12/2002
MTB > Execute 'C:\MINITAB\252SOLS.MTB' 1. (Executes previously stored commands)
Executing from file: C:\MINITAB\252SOLS.MTB
Regression Analysis
The regression equation is
shares = 57 + 37.2 years
Predictor
Constant
years
Coef
57.0
37.186
s = 91.77
Stdev
108.8
9.895
R-sq = 70.2%
t-ratio
0.52
3.76
p
0.619
0.009
R-sq(adj) = 65.2%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
1
6
7
SS
118921
50525
169446
MS
118921
8421
F
14.12
p
0.009
16
Download