252y0431 4/30/04 ECO252 QBA2 Name

advertisement
252y0431 4/30/04
(Page layout view!)
ECO252 QBA2
THIRD HOUR EXAM
Apr 16 2004
Name
Key
Hour of Class Registered ____
I. (30+ points) Do all the following (2 points each unless noted otherwise). Do not answer question ‘yes’
or ‘no’ without giving reasons.
1. Turn in your computer problems 2 and 3 marked as requested in the Take-home. (5 points, 2 point
penalty for not doing.)
2. (Dummeldinger) As part of a study to investigate the effect of helmet design on football injuries,
head width measurements were taken for 30 subjects randomly selected from each of 3 groups (High
school football players, college football players and college students who do not play football – so that
there are a total of 90 observations) with the object of comparing the typical head widths of the three
groups. If the researchers are reluctant to assume that the data in each of these three groups comes from
a Normally distributed population, they should use the following method.
a. *The Kruskal-Wallis test.
b. One-way ANOVA
c. The Friedman test
d. Two-Way ANOVA
3. Assume that the researchers ignore your advice, whether right or wrong, in problem 2. If one-way
ANOVA is used, how many degrees of freedom apply to the Within sum of squares? [9]
Solution: Because n  90, there are a total of 89 degrees of freedom. Three columns use up 2 degrees
of freedom, leaving n  m  87 .
4. (Berenson et al.) In a study of drive-through times at fast food chains, the following was recorded (in
seconds).
n1  n 2  n3  n 4 n 5  20 , x.1  150 , x.2  167 , x.3  169 , x.4  171, x.5  172 , where 1 =
Wendy’s, 2 = McDonald’s, 3 = Checkers, 4 = Burger King, 5 = Long John Silver’s.
H 0 : 1   2   3   4   5
One-Way ANOVA
Source
DF
SS
MS
F Statistic
p Value
4
????
????
????
3.24067E-08
Between
95
12407
130.6
????
Within
99
Total
You do not need to fill in any of the omitted data. Does the ANOVA show a significant difference
between drive through times? Why? (2) [11]
Solution: Because the p-value is .000000032067, which is less than any value of  that we might use,
reject the null hypothesis of equal mean drive-through times for the three chains.
5. From the above ANOVA and the means given above, do the mean time times for McDonald’s and
Long John Silver’s differ significantly? Use a Tukey method. (5)
All right you Tukeys, here’s the answer. The formula given in the last graded assignment was
1   2  x1  x2   q m,n  m 
s
2
1
1
. This gives rise to Tukey’s HSD (Honestly Significant

n1 n 2
Difference) procedure. Two sample means x .1 and x .2 are significantly different if x.1  x.2 is greater
252y0431 4/15/04
than q m,n  m 
s
2
1
1
. We have for McDonald’s x.2  167 , and for Silver’s x.5  172 . We can read

n1 n 2
n1  n 2  n3  n 4 n 5  20 , s  MSW  130.6  11.428 and n  m  95 from the printout. From the
m, n  m 
Tukey table q.05
 q.505,95  3.93 . The confidence interval is thus
 2   5  x2  x5   q m,n  m 
s
1
1
11 .428

 167  172   3.93
n 2 n5
2
1
2
 5  3.93 11 .428 
20
20
2
 5  44.912 0.2236   5  10.04 . Since 10.04 is larger in absolute value than -5, this confidence interval
includes zero and the difference is not significant.
Exhibit 1: A large national bank charges local companies for using their services. A bank official reported the
results of a regression analysis designed to predict the bank’s charges (Y) -- measured in dollars per month -- for
services rendered to local companies. One independent variable used to predict service charge to a company is the
company’s sales revenue (X) -- measured in millions of dollars. Data for 21 companies who use the bank’s
services were used to fit the model. The analyst took the Minitab output home to check out, but it fell into a
puddle and all that he (or I) can read is below.
The regression equation is
Y = -2700 + 20.00 X
Predictor
Constant
X
s = 65.00
6.
7.
8.
Coef
-2700.0
20.000
R-sq = ----
Stdev
-------
t-ratio
-----
p
0.600
0.034
R-sq(adj) = ----
Referring to Exhibit 1, interpret the estimate of  0 , the Y-intercept of the line.
a) All companies will be charged at least $2,700 by the bank.
b) *There is no practical interpretation since a sales revenue of $0 is a nonsensical value.
c) About 95% of the observed service charges fall within $2,700 of the least squares line.
d) For every $1 million increase in sales revenue, we expect a service charge to decrease
$2,700.
Referring to Exhibit 1, interpret the p value for testing whether  1 exceeds 0.
a) *There is sufficient evidence (at the  = 0.05 level) to conclude that sales revenue (X) is
a useful linear predictor of service charge (Y).
b) There is insufficient evidence (at the  = 0.10 level) to conclude that sales revenue (X) is
a useful linear predictor of service charge (Y).
c) Sales revenue (X) is a poor predictor of service charge (Y).
d) For every $1 million increase in sales revenue, we expect a service charge to increase
$0.034.
Referring to Exhibit 1, a 95% confidence interval for  1 is (15, 30). Interpret the interval.
a) We are 95% confident that the mean service charge will fall between $15 and $30 per
month.
b) We are 95% confident that the sales revenue (X) will increase between $15 and $30
million for every $1 increase in service charge (Y).
c) *We are 95% confident that average service charge (Y) will increase between $15 and
$30 for every $1 million increase in sales revenue (X).
d) At the  = 0.05 level, there is no evidence of a linear relationship between service
charge (Y) and sales revenue (X). [22]
2
252y0431 4/15/04
Exhibit 2: The marketing manager of a company producing a new cereal aimed for children wants to examine the
effect of the color and shape of the box's logo on the approval rating of the cereal. He combined 4 colors and 3
shapes to produce a total of 12 designs. Each logo was presented to 2 different groups (a total of 24 groups) and
the approval rating for each was recorded and is shown below. The manager analyzed these data using the  =
0.05 level of significance for all inferences.
SHAPES
Red
Circle
54
44
34
36
46
48
Square
Diamond
COLORS
Green
Blue
67
61
56
58
60
60
Yellow
36
44
36
30
34
38
45
41
21
25
31
33
Analysis of Variance
Source
df
Colors
3
Shapes
2
Interaction 6
Error
12
Total
23
SS
2711.17
579.00
150.33
150.00
3590.50
MS
903.72
289.50
25.055
12.500
F
p
72.30
23.16
2.044
0.000
0.000
9. Referring to Exhibit 2, fill in the first 5 missing numbers (not the missing p-value). (3)
Answer: Values are filled in in red. Note that 6 is the product of 3 and 2 and that 12 makes the column
add up. The MS column can be found by dividing SS by df . The F column is the MS values divided
by MSW  12.5 , so that MSW can be gotten or checked from the two values of F that are supplied.
10. Referring to Exhibit 2, assume that your degrees of freedom are correct and find the 5% value of F
on the table that would be used to test if the interaction is significant. What is your conclusion and
why? (3) [28]
6,12  3.00 . Since the computed F is below the table F, we do not reject the null
Answer: F.05
hypothesis and conclude that the interaction is not significant.
3
252y0431 4/15/04
Exhibit 3 (Mendenhall, et al.): The president of a local company has asked the vice presidents of the firm to
provide an analysis of the business climate of 4 states that may be considered for the location of a manufacturing
facility. Each VP rates the state’s business climate on a 1-10 scale with 10 as outstanding and 1 as unacceptable.
State
Vice President
Arkansas
Colorado
Illinois
Iowa
Abel
8.5
8.0
3.5
6.0
Baker
7.5
8.0
6.0
5.5
Charley
9.0
6.0
4.0
7.0
Dogg
8.0
6.0
7.0
4.0
Easy
7.0
5.5
4.5
7.5
11. Referring to Exhibit 3, assume that the underlying distribution is not Normal. Do an appropriate
analysis. a)Tell what test you are going to use.(1) b) State your null hypothesis. (1) c) Perform the
test and state your conclusion. (4) d) On the basis of your results, should business climate be
considered in locating the facility? Why? (1) [35]
Solution: . a) We use the Friedman test. The following is edited from the outline.
The Friedman test is equivalent to two-way ANOVA with one observation per cell when the underlying distribution is
non-normal. The null hypothesis is H 0 : Columns come from same distribution or the medians are equal. Note that
the only difference between this and the Kruskal-Wallis test is that the data is cross-classified in the Friedman test.
There are c columns and r rows. In each row the numbers are ranked from 1 to c . For each column, compute SRi , the
rank sum of column i . To check the ranking, note that the sum of the rank sums is

12
 rc c  1
compute the Friedman statistic  F2  
shown in the Friedman Table, use the
not larger than
 SR

i
rc c  1
. Now
2

 SR   3r c  1 . If the size of the problem is larger than those
2
i
i
 distribution, with df  c  1 , where c is the number of columns. If  F2 is
2
 .205 , do not reject the null hypothesis.
b) The null hypothesis is equal medians.
c) The table is shown below with names replaced by numbers and rankings shown.
x1
r1
x2
r2
x3
r3
x4
Row 1
8.5
4
8.0
3
3.5
1
6.0
2
Row 2
7.5
3
8.0
4
6.0
2
5.5
1
Row 3
9.0
4
6.0
2
4.0
1
7.0
3
Row 4
8.0
4
6.0
2
7.0
3
4.0
1
Row 5
7.0
3
5.5
2
4.5
1
7.5
4
SR
18
13
8
The sum of the rank sums is 18 + 13 + 8 + 11 = 50 and this checks against

 12
545
 50 . The Friedman statistic is  F2  
2
 rc c  1

r4
11

rc c  1
SRi 
2

 SR   3r c  1
2
i
i

 12


18 2  13 2  8 2  11 2   355  0.12 324  169  64  121   75  81 .36  75  6.36 The



5
4
5


3
dimensions are to large for the Friedman table, so compare this with  2 .05 =7.8147. Since the
computed chi-squared is smaller than the table value, do not reject the null hypothesis.
d) So, since there is no significant difference in business climate, do not consider it.
4
252y0431 4/15/04
ECO252 QBA2
Third EXAM
Apr 16, 2004
TAKE HOME SECTION
Name: _________________________
Student Number: _________________________
Class days and time : _________________________
Please Note: computer problems 2 and 3 should be turned in with the exam. In problem 2, the 2 way
ANOVA table should be completed. The three F tests should be done with a 5% significance level and you
should note whether there was (i) a significant difference between drivers, (ii) a significant difference
between cars and (iii) significant interaction. In problem 3, you should show on your third graph where the
regression line is.
II. Do the following: (23+ points). Assume a 5% significance level. Show your work!
1. (Albright, Winston, Zappe) Boa Constructors, an international construction company with offices in Texas, the Cayman Islands,
Belarus, Bosnia and Iraq, conducts an employee empowerment program and after a few months asks random samples of its
employees in each office to rate the program on a 1- 10 scale. Assume that each column below represents a random sample taken in
one office. Assume that the underlying distribution is Normal and test the hypothesis 1   2   3   4   5 . Data is on the
next page.
a) Note that office 2, the ‘head’ office in the Cayman Islands, has a smaller sample than the rest. You can help by adding a seventh
measurement, the third to last digit of your student number (If it’s a zero, use 10). For example, Seymour Butz’s student number is
976500 and he will have a second column that reads 7, 6, 10, 3, 9, 10, 5. This should not change the results by much. Find the
sample variance of this column. (2)
b) Test the hypothesis (6) Show your work – it is legitimate to check your results by running these problems on the computer, but I
expect to see hand computations for every part of them.
c) Compare means two by two, using any one appropriate statistical method, to find out which were happiest. Actually, we really
want to test if the programs worked significantly better in the first two offices, which are English-speaking, than the other three,
which are not. Citing numbers from your comparison results, is this correct? (3)
d) (Extra Credit) Now we find out that this was not a random sample and that each row represents a separate job description. If this
changes your analysis, redo the analysis. In order to fill out the data from the Cayman Islands, use the last two digits in your student
number. For example, Seymour Butz’s student number is 976500 and he will have a second column that reads 7, 6, 10, 3, 9, 10, 5,
10, 10 (5)
e) (Extra Credit) What if you found out that each column in the data in b) was a random sample from a non-Normal distribution? If
this changes your analysis, redo the analysis. (5)
f) Run Levene’s test on the data in b) . You may do this by computer. There will be lots of output, but just look at the 2 or 3 lines
from Levene’s test. What does it test for and what is the conclusion ? (2)
Hint: If you put your data in the first 5 columns of Minitab with a column number above them, the following should be of interest.
MTB > AOVOneway c1-c5.
#Does a 1-way ANOVA
MTB > Stack c1-c5 c11;
# Stacks the data in c12, col.no. in c12.
SUBC>
Subscripts c12;
SUBC>
UseNames.
MTB > rank c11 c13
#Puts the ranks of the stacked data in c13
MTB > Unstack (c13);
SUBC>
Subscripts c12;
SUBC>
After;
#Unstacks the data in the next 5 available
SUBC>
VarNames.
# columns. Uses IDs in c12.
MTB > %Vartest c11 c12
#Does a bunch of tests, including Levene’s
On stacked data in c11 with IDs in c12.
If you remember what you did in Computer Problem 2, you should be able to add row numbers
in an unused column and run a 2-way ANOVA.
Row
1
2
3
4
5
6
7
8
9
1
8
2
9
8
3
10
9
6
8
Ratings of Program
Office
2
3
4
5
7
6
10
3
9
10
7
5
5
5
4
3
5
5
3
5
3
6
9
6
5
5
6
3
6
6
6
6
3
4
8
6
2
Sum of column 1 = 63.000
Sum of squares of column
Sum of column 3 = 42.000
Sum of squares of column
Sum of column 4 = 48.000
Sum of squares of column
Sum of column 5 = 47.000
Sum of squares of column
1 = 503.00
3 = 208.00
4 = 282.00
5 = 273.00
5
252y0431 4/15/04
Solution: Seymour’s data is as below.
Row
1
2
3
4
5
6
7
8
9
a) s 2 
Ratings of Program
Office
1
8
2
9
8
3
10
9
6
8
x
2
2
3
4
5
7
6
10
3
9
10
5
7
5
5
5
4
3
5
5
3
5
3
6
9
6
5
5
6
3
6
6
6
6
3
4
8
6
2
 nx 2
n 1

b) Row
400  77.143 2
 7.1405
6
1
8
2
9
8
3
10
9
6
8
1
2
3
4
5
6
7
8
9
sum
Sum of column 1 = 63.000
Sum of squares of column
503.00
Sum of column 2 = 50.000
Sum of squares of column
400.00
Sum of column 3 = 42.000
Sum of squares of column
208.00
Sum of column 4 = 48.000
Sum of squares of column
282.00
Sum of column 5 = 47.000
Sum of squares of column
273.00
2
7
6
10
3
9
10
5
3
7
5
5
5
4
3
5
5
3
4
5
3
6
9
6
5
5
6
3
5
6
6
6
6
3
4
8
6
2
3 =
4 =
5 =
Total
.
50.000
42.000
48.000
47.000
n j (count)
9
7.000
9.000
9.000
9.000
x j (mean)
7
7.143
4.667
5.333
5.222
503
400.000
208.000
282.000
273.000
49
51.020
21.778
28.444
27.272
meansq
1 =
s  2.673
63
SS
1 =
250.00 
x
43.00  n
5.814  x 
1666.00
x
 x
n
2
 x  nx  1666  435.814  1666 1453.512  212.488
SSB   n x  nx  949   751 .020   921 .778   928 .444   927 .272   1453 .512  1495 .136  1453 .512
SST 
2
2
j .j
2
2
2
 41 .624
SSW  SST  SSB  212 .488  41.624  170 .864
Source
SS
DF
MS
F
Between
41.624
4
10.406
2.31
Within
170.864
38
4.496
Total
212.488
42

4,38
Since F.05  2.62 we do not reject H 0 which is ‘no difference between mean satisfaction by offices.’ I’m
really surprised. There isn’t too much reason to compare offices at this point.
6
252y0431 4/15/04
c) The two useful intervals would be Scheffe’ 1   2  x1  x2  
s
and Tukey 1   2  x1  x2   q m,n  m 
2


1
1 

n1 n2 

1
1
. For Scheffe’ the error part of the interval is

n1 n 2
1
1 
2

 42.62 4.496 
 6.864 .4714   3.236 or

n1 n 2 
9
m  1Fm1,nm  s

 42.62 4.496 

m  1Fm 1, n  m  s
1 1
  6.864 .5040   3.459 for intervals with x2 in them. For Tukey the error part of
9 7
the interval is  q 5,38
4.496
2
1
1
4.496
, which is about  4.05

n1 n 2
2
1 1
1
  6.07
 2.02 or
9 9
9
1 1
8
  6.07
 2.16 . So let’s look at the differences.
9
7
63
2
12
7 – 7.143 = -0.143
24
7.143 – 5.333 = 1.810
13
7 – 4.667 = 2.333s
25
7.143 – 5.222 = 1.921
14
7 – 5.333 = 1.667
34
4.667 – 5.333 = -0.666
15
7 – 5.222 = 1.778
35
4.667 – 5.222 = -0.555
23
7.143 – 4.667 = 2.476s
45
5.333 – 5.222 = 0.111
Obviously, none of the differences are as large as the error terms by the Scheffe’ criterion, so none of them
are significant by this criterion and this small sample gives us no evidence of differences between English
and non-English speaking offices. The less conservative Tukey differences show the third office to be less
happy with the program than the first or second.
d) This is real work. Seymour has the following.
 4.05
4.496
Row
1
1
2
3
4
5
6
7
8
9
sum
2
4
7
5
5
5
4
3
5
5
3
42.00
5
5
3
6
9
6
5
5
6
3
48.00
sum
8
2
9
8
3
10
9
6
8
63
7
6
10
3
9
10
5
10
10
70.000
count
9
9.000
9.000
9.000
9.000
x j 
7
7.778
4.667
5.333
5.222
6
6
6
6
3
4
8
6
2
47.0
count
33
5
22
5
36
5
31
5
25
5
32
5
32
5
33
5
26
5
270.00 45
SS
6.6
4.4
7.2
6.2
5.0
6.4
6.4
6.6
5.2
6.0
223
110
278
215
151
250
220
233
186
1866
x
45.00
xi 2
x i 
43.56
19.36
51.84
38.44
25.00
40.96
40.96
43.56
27.04
330.72
2
 xijk
 x i 2
6.00  x
2
503 600.000 208.000 282.000 273.000 1866.00  xijk
SS
x j2
49
60.494
From the above
x
3
21.778
 x  270 ,
 x  270  6.00 .
n
x
SSR  C x
SSC  R
45
28.444
n  45 ,
SST 
27.272
 x
 x
2
ij
2
ij
186.99  x . j .
2
 1866 ,
x
2
i.
 330 .72
x
2
.j
 186 .99 and
 n x  1866  456.00 2  1866  1620  246 .
2
2
.j
 nx 2  9186.99  456.002  1682.91 1620  62.91 .
2
i.
 nx 2  5330.72  456.002  1653.60  1620  33.20 SSW  SST  SSC  SSR  149 .49
7
252y0431 4/15/04
Source
Rows
Columns
Within
Total
SS
33.60
62.91
149.49
246.00
DF
8
4
32
44
MS
4.20
15.7275
4.6716
F
0.90ns
3.37s
8,32  2.29
F.05
4,32  3.32
F.05
8,32  2.29 we do not reject
Since F.05
H 01 which is ‘no difference between individual (row) means.’ Since
4,32  3.32 we reject
F.05
H 02 which is ‘no difference between office (column) means.
e) If each column is a random sample from a non-normal distribution, use a Kruskal-Wallis test. We only
need a Friedman test if the data is cross-classified. The original data and its ranks are shown below. The
ranking should go from one to 47, but because there are so many ties, each number represents an average
rank.
Column
x1
8
2
9
8
3
10
9
6
8
Sum
x2
x3
x4
x5
r1
r2
r3
r4
r5
7
6
10
3
9
10
5
7
5
5
5
4
3
5
5
3
5
3
6
9
6
5
5
6
3
6
6
6
6
3
4
8
6
2
34.5
1.5
38.5
34.5
6.0
42.0
38.5
25.5
34.5
31.5
25.5
42.0
6.0
38.5
42.0
16.0
31.5
16.0
16.0
16.0
10.5
6.0
16.0
16.0
6.0
16.0
6.0
25.5
38.5
25.5
16.0
16.0
25.5
6.0
25.5
25.5
25.5
25.5
6.0
10.5
34.5
25.5
1.5
255.5
201.5
134.0
175.0
180.0
To check the ranking, note that the sum of the five rank sums is 255.5 + 201.5 + 134.0 + 175.5 + 180.0 =
nn  1 4344 

 946 .
946.0, and that the sum of the first n numbers is
2
2
 12
 SRi 2 

  3n  1
Now, compute the Kruskal-Wallis statistic H  
 nn  1 i  ni 
 12  255 .52 201 .52 134 .02 175 .02 180 .02 

  344   .006342 22051 .57   132  7.862 .






9
7
9
9
9
 4344  

If. Since both are above   .05 , do not reject H 0 .
Since the size of the problem is larger than those shown in the Kruskal-Wallis table , use the
 2 distribution, with df  m  1  4 , where m is the number of columns. Compare H with

 .2054   9.48775 . Since H is smaller than  .205 , do not reject the null hypothesis.
8
252y0431 4/15/04
f) As I threatened, I ran this on the computer. Data was stacked in c1-c5. The ‘stat’ pulldown menu was
chosen and ANOVA picked, followed by ‘test for equal variances.’ Output follows.
MTB > %Vartest c11 c12;
SUBC>
Confidence 95.0.
Executing from file: W:\wminitab13\MACROS\Vartest.MAC
Macro is running ... please wait
Test for Equal Variances
Response
C11
Factors
C12
ConfLvl
95.0000
Bonferroni confidence intervals for standard deviations (Comment: These are apparently intervals of the
type
n  1s 2
 2 
 2
n  1s 2
2
1
2k
. Note, for example that there are
k  5 intervals and 1.68047 2 

82.78388 2
8
 2 .005
.)
2k
Lower
Sigma
Upper
N
Factor Levels
1.68047
1.52009
0.73931
1.08823
1.12031
2.78388
2.67261
1.22474
1.80278
1.85592
6.79093
7.96390
2.98761
4.39765
4.52729
9
7
9
9
9
1
2
3
4
5
Bartlett's Test (normal distribution) (Comment: This was explained in the new outline pages and has a null
hypothesis of equal variances, which, because of the high p-value, we do not reject.)
Test Statistic: 5.961
P-Value
: 0.202
Levene's Test (any continuous distribution) (Comment: This was the only part that you were supposed to
worry about. This was explained in the new outline pages and has a null hypothesis of equal variances, which, because of the high
p-value, we do not reject.)
Test Statistic: 1.056
P-Value
: 0.391
Test for Equal Variances: C11 vs C12
Test for Equal Variances for C11
95% Confidence Intervals for Sigmas
Factor Levels
1
Bartlett's Test
Test Statistic: 5.961
2
P-Value
: 0.202
3
Levene's Test
4
Test Statistic: 1.056
P-Value
: 0.391
5
0
1
2
3
4
5
6
7
8
9
252y0431 4/15/04
2. (Keller, Warrack) A dealer records the odometer reading and selling price in thousands of a sample of 100 3-year old Ford
Tauruses (well equipped and in excellent condition) sold at auction. Unfortunately, he missed one car in his initial computations. The
101st car has an odometer reading of 16.000 (in thousands)
and sold at 14.800 plus the last three digits of your student number divided by 1000. For example, Seymour Butz’s student number is
976500, so he thinks the car sold at $14.800 + $0.500 = 15.300 (thousands). The column sums are given below without the 101st car,
so you should find it easy to adjust these sums for the 101st car.
Row
Odometer Price
1
2
3
.
.
.
98
99
100
37.388
44.758
45.833
..
..
..
33.190
36.196
36.392
14.646
14.122
14.016
..
..
..
14.518
14.712
14.266
sumy
sumx
smxsq
smxy
smysq
1482.28
3600.95
133977
53107.6
21997.3
Note that these sums can’t be used
directly, but they should help you to get
the corrected numbers.
‘Price’ is the dependent variable and ‘Odometer’ is the independent variable. If you don’t know what that means, don’t do the
problem until you find out. Show your work – it is legitimate to check your results by running the problem on the computer, but I
expect to see hand computations that show clearly where you got your numbers for every part of this problem.
a. Compute the regression equation
Y  b0  b1 x to predict the ‘Price’ on the basis of ‘Odometer’. (2)
b. How much does the equation predict that a car with an odometer reading of 35000 miles will sell for? If the answer isn’t
reasonable compared to the prices shown above, find your mistake and fix it. (1)
2
c. Compute R . (2)
d. Compute s e . (2)
e. Compute
s b0
and do a significance test on
f . Compute
s b1
and do a confidence interval for
b0 (1.5)
b1 (1.5)
g. Do an ANOVA table for the regression. What conclusion can you draw from this table about the relationship between
odometer reading and price? Why? (2)
h. Do a prediction interval for the selling price of the car in b. Explain the difference between this and a confidence interval and
why this is the appropriate interval to use here. (3) [73]
Solution: Seymour’s data is modified as below.
sumy
sumx
smxsq
sumxy
wrong. If sumxy 
1482
3601
133977
53108
+
+
+
+
15.3
16
256
244.8
=
=
=
=
1498
3617
134233
53352 I find it hard to believe that so many people got this
 xy , then you should have added x
101 y101
to it, not x101  y101 . Even if you didn’t know
this, and you only had 20 or 30 examples, you shoulod have realized something was wrong when R 2 came out
above 1 or SSE came out negative. These are ‘unreasonable’ answers and it would at least have been wise
to cover your tails by admitting it.
smysq
n
n  101 ,
21997 + 234.09
100 + 1
 x  3617 ,  y  1498 ,  x
= 22231
= 101
2
 134233 ,
 xy  53352
Spare Parts Computation:
SSx 
x
2
 x  3617  35.8113
x
 4705 .75 *
 y  1498  14.8275
y
 277 .992
n
101
n
101
Sxy 
SSy 
and
y
2
 22231 .
 nx 2  134233 10135.81132
 xy  nx y  53352  10135.8113 14.8275 
y
2
 ny  22231  10114 .8275 2
2
 25 .9650  SST *
10
252y0431 4/15/04
a) b1 
 xy  nxy   277 .992  0.05907
 x  nx 4705 .75
Sxy

SSx
2
2
Since price falls as odometer rises b1  0.
b0  y  b1 x  14.8275  0.05907  35.8113  16.9431
So Y  b  b x becomes Yˆ  16 .9431  0.05907 x .
0
1
b) Yˆ  16.9431 0.0590735  14.876 , a price that looks very much like the ones in the original data.
c) SSR  b1 Sxy  .05907 (277 .992 )  16.4224 * R 2 
R2 
Sxy2   277 .992 2
SSxSSy 4705 .75 25.9650 
SSR 16 .4224

 .6324 or
SST 25 .9650
This must be between zero and one.
d) SSE  SST  SSR  25.9650  16.4224  9.5426 * s e2 
s e  0.09640  0.3105
SSE
9.5426

 0.09640
n2
99
( s e2 is always positive!)
H 0 :  0   00  0
99
e) For the remainder of this problem, t n2  t.025
 1.984 We are testing 
2
H 1 :  0   00  0
1 X 2 
 1 35 .8113 2 
s b20  s e2  

  0.09640 
  0.09640 0.009901  0.272528   0.02723 *
 n SSx 
101 4705 .75 
sb0  0.02723  0.16500 * t 
b0   00 16 .9431

 102 .69
0.16500
s b0
Make a diagram of an almost Normal curve with zero in the middle and, if   .05 , the rejection zones
are above t n2  1.984 and below  t n2  1.984 . Since our computed t-ratio is, at 102.69, not between


2
2
the two critical values, we reject the null hypothesis that the coefficient is not significantly different from
zero and we can say that b0 is significant.
 1 
 1

f) s b21  s e2 
  0.09640 
  0.00002049 * sb1  0.00002049  0.004526 *
4705
.
75
SS


 x
1  b1  t sb1  0.05984  1.984 0.004526   0.060  0.009 . Note that the error part of the equation
2
is smaller in absolute value than the slope, so the slope is significant.
g) Note that the F test shown next shows the same information as in f). In fact, the two F s in the table are
just the squares of our t s. Because the computed F is larger than the table F, we reject the null
hypothesis of no relationship between odometer reading and price.
Source
SS
DF
MS
F
F.05
Regression
16.4224*
1
16.4224
170.375s
F 1,99  3.14
Error (Within)
9.5426*
Total
25.9650*
* These quantities must be positive.
99
100
0.09639
11
252y0431 4/15/04


1 X X 2

 1 35  35 .8113 2

h) sY2  s e2   0
 1  0.09640 

 1
n



SS x
4705 .75
 101



 0.09640 0.009901  0.000140  1  0.09640 1.01004   0.0974 *
sYˆ  0.0974  0.3120 * Y0  Yˆ0  t sY  14.876  1.984 0.3120   14.876  0.6190 A prediction interval
is appropriate when we are concerned about the price of one car, rather than the average price of cars with
the same odometer reading.
12
Download