252x0644 12/14/05 Name and Class hour:_________________________

advertisement
252x0644 12/14/05
ECO252 QBA2
Final EXAM
December , 2006
Version 4
Name and Class hour:_________________________
I. (18+ points) Do all the following. Note that answers without reasons and citation of appropriate
statistical tests receive no credit. Most answers require a statistical test, that is, stating or implying a
hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it
lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere
Else)
Regression B seeks to explain the selling price of a home in terms of a group of variables explained on the
output sheet. Note that regressions 12 and 17 are identical. Look at the definitions of the variables carefully
and, in particular, notice which are interaction variables.
a) The homes in this regression have been rated High, Med or Low by realtors. There are dummy variables
to indicate the ratings. Why didn’t I use High or AH in regression 12? (1)
b) In Regression 12, what coefficients are significant at the 1% level? (2)
c) What independent variables did I remove from the problem to get to Regression 13 from Regression 12?
Why? (2)
d) Following the same process, I went on to remove one or more variables to get to Regression 13. When I
got to Regression 13 I ran the ‘best subsets regression.’ 14. I concluded that it was time to quit removing
variables. Between the best subsets regression and the characteristics of the coefficients of the results in
Regression 13 I felt that I had gone as far as was reasonable in removing independent variables. What are
the three things that led me to think that regression 5 was almost the best that I could do? Remember that a
close relationship between Sq.ft and Sqftsq is excusable. What in the printout might make you question my
judgment? (3)
e) Using Regression 13 and assuming that all homes have areas of 1000 sq ft., Regression 13 effectively
becomes 3 regressions relating Market price to Assessment. Take the coefficient of Sq.ft, multiplied by
1000 and the coefficient of Sqftsqsq multiplied by 1000 2 . Add them to the constant to get the effective
intercept for homes with areas of 1000 sq. ft.. Using A or any other symbol that you find convenient for
living area, what are the equations relating assessment to Market price for (4 points)
Low homes?
Med homes?
High homes ?
Is the difference between the slopes of these three equations relative to market significant? Why? [12]
1
252x0644 12/14/05
f) Continuing with Regression 13 and assuming that a home has 1000 square feet of living area and an
assessment of 24, what would it sell for if it were rated
Low?
Med?
High?
What is the percent difference between the lowest and highest price? (2)
g) We have not yet dealt with the question of whether the coefficients in Regression 5 are reasonable. In
order to do this look at two homes, one with an area of 1000 and the second with an area of 1001. By how
much will their Market prices differ? Does that seem reasonable? (3)
[17]
h) As I warned you, I now repeated Regression 12 as Regression 15, without using the VIFs. I decided to
drop 1 variable. Why? (1)
i) I could now add AH to the independent variables and did equation 16. I dropped it immediately. Why?
(1)
j) I now ran Regression 17 without one fewer independent variable than Regression 15 and did the same
thing to get to Regression 18. How does Regression 18 compare with Regression 13? (2)
j) Regression 17 is a stepwise regression. The printout presents four different possible regressions in
column form. Look at in each case a coefficient has a t-value under it and a p-value for a significance test.
After the fourth try, the computer refused to add any more independent variables. The only regression here
that I thought was worth looking at was the one with four independent variables. What can you tell me
about its acceptability?
(3)
[24]
k) Do an F test to compare regressions 15 and 18 and to see if the two variables removed had any
explanatory power.
II. Hand in your third computer problem. (2 to 7 points)
2
252x0644 12/14/05
III. Do at least 4 of the following 7 Problems (at least 12 each) (or do sections adding to at least 50 points –
(Anything extra you do helps, and grades wrap around). You must do parts a) and b) of problem 1. Show
your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise.
Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses
and what values from what table were used to test them. Clearly label what section of each problem
you are doing! The entire test has about 151 points, but 70 is considered a perfect score. Don’t waste
our time by telling me that two means, proportions, variances or medians don’t look the same to you.
You need statistical tests! There are two blank pages below.
1. a) If I want to test to see if the mean of x 2 is smaller than the mean of x1 my null hypotheses are:
(Note: D  1   2 )
i) 1   2 and D  0
ii) 1   2 and D  0
v) 1   2 and D  0
vi) 1   2 and D  0
iii) 1   2 and D  0
vii) 1   2 and D  0
iv) 1   2 and D  0
viii) 1   2 and D  0 (2)
The first two columns below represent times for 25 workers on an industrial task. The third column is the
difference between them
d
Row
x1
x2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
5.11
4.13
5.42
3.65
4.82
3.08
3.01
4.26
4.25
6.66
5.29
4.41
5.17
4.50
3.06
5.19
5.71
3.41
4.25
3.85
5.50
4.24
6.29
3.99
3.26
4.81
4.19
5.17
4.07
4.58
2.97
3.39
4.14
4.31
6.68
5.37
3.95
4.93
4.04
2.40
4.71
5.93
2.93
4.25
4.41
4.68
3.50
6.09
2.87
3.06
0.30
-0.06
0.25
-0.42
0.24
0.11
-0.38
0.12
-0.06
-0.02
-0.08
0.46
0.24
0.46
0.66
0.48
-0.22
0.48
0.00
-0.56
0.82
0.74
0.20
1.12
0.20
Assume that   .05 . Minitab gives us the following summary (edited).
Descriptive Statistics: x1, x2, d
Variable N
x1
25
x2
25
d
25
N*
0
0
0
Mean SE Mean
4.50
0.200
4.30
0.212
0.20
……………
StDev Minimum
Q1
1.001
3.010
3.750
1.062
2.400
3.445
…………… -0.5600 -0.0600
Median
Q3 Maximum
4.260
5.240 6.660
4.250
4.870 6.680
0.2000 0.4700 1.1200
In the d column, the column sum is 5.08 and the sum of the first 24 numbers squared is 4.825. Do not
recompute things that have been done for you if you want to ever get much done on this exam.
Clearly label parts b, c, d etc. The null hypothesis is the same for parts c, d and e, so state it clearly.
b). Find the sample variance for the d column. (2)
3
252x0644 12/07/05
c) On the assumption that the underlying distributions are Normal and that the first two columns represent
independent samples from populations that represent plants 1 and 2 and come from populations with similar
variances, can we conclude that average workers in plant 2 complete the task faster than those in plant 1?
(4)
d) (Extra credit) Repeat part c) after dropping the assumption that the variances are similar. (5)
e) Actually, these data supposedly represent performance of a single sample of 25 workers on two
administrations of a standard test of manual dexterity. The question was ‘Did the time for the test improve
between the first and second administration?’ (3)
[11]
f) Assume that the means above come from independent samples, but that the data represent samples for
populations with known population variances of 1.00 and 1.06. Test the null hypothesis that you used in
part c) and find an exact p-value. (3)
[14]
g) Using the value of s d that you used in e), make a confidence interval with a confidence level of 94%.
You must find the value of z  needed to do this first. Of course, it is not on the t-table. (2) [16]
2
4
252x0644 12/07/05
2. Let us expand the problem of question 1 by adding another column. The full data set with lots done for
you looks like this. The first three columns represent the given data. In the next three columns I have take
the first three columns and squared them. I have added the first three rows to get the seventh column. I have
computed row means in the 9th column. The tenth column is a row sum of squares. In the 11 th to the 13th
columns the numbers in the first three columns are ranked from 1 to 75. Sums are provided for all 13
columns. You
Row
x 1
x 2
x 3
x21
x22
x 23
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
x1
5.11
4.13
5.42
3.65
4.82
3.08
3.01
4.26
4.25
6.66
5.29
4.41
5.17
4.50
3.06
5.19
5.71
3.41
4.25
3.85
5.50
4.24
6.29
3.99
3.26
x2
4.81
4.19
5.17
4.07
4.58
2.97
3.39
4.14
4.31
6.68
5.37
3.95
4.93
4.04
2.40
4.71
5.93
2.93
4.25
4.41
4.68
3.50
6.09
2.87
3.06
x3
4.96
4.16
5.29
3.86
4.70
3.02
3.20
4.20
4.28
6.67
5.33
4.18
5.05
4.27
2.73
4.95
5.82
3.17
4.25
4.13
5.09
3.87
6.19
3.43
3.16
x1sq
26.1121
17.0569
29.3764
13.3225
23.2324
9.4864
9.0601
18.1476
18.0625
44.3556
27.9841
19.4481
26.7289
20.2500
9.3636
26.9361
32.6041
11.6281
18.0625
14.8225
30.2500
17.9776
39.5641
15.9201
10.6276
x2sq
23.1361
17.5561
26.7289
16.5649
20.9764
8.8209
11.4921
17.1396
18.5761
44.6224
28.8369
15.6025
24.3049
16.3216
5.7600
22.1841
35.1649
8.5849
18.0625
19.4481
21.9024
12.2500
37.0881
8.2369
9.3636
x3sq
24.6016
17.3056
27.9841
14.8996
22.0900
9.1204
10.2400
17.6400
18.3184
44.4889
28.4089
17.4724
25.5025
18.2329
7.4529
24.5025
33.8724
10.0489
18.0625
17.0569
25.9081
14.9769
38.3161
11.7649
9.9856
(2)
(3)
(4)
Sum (1)
(5)
(6)
x
x i
x i2
rsum
14.88
12.48
15.88
11.58
14.10
9.07
9.60
12.60
12.84
20.01
15.99
12.54
15.15
12.81
8.19
14.85
17.46
9.51
12.75
12.39
15.27
11.61
18.57
10.29
9.48
rmean
4.96000
4.16000
5.29333
3.86000
4.70000
3.02333
3.20000
4.20000
4.28000
6.67000
5.33000
4.18000
5.05000
4.27000
2.73000
4.95000
5.82000
3.17000
4.25000
4.13000
5.09000
3.87000
6.19000
3.43000
3.16000
rmsq
24.6016
17.3056
28.0194
14.8996
22.0900
9.1405
10.2400
17.6400
18.3184
44.4889
28.4089
17.4724
25.5025
18.2329
7.4529
24.5025
33.8724
10.0489
18.0625
17.0569
25.9081
14.9769
38.3161
11.7649
9.9856
(7)
(8)
(9)
i
x
2
i
r1
rssq
rank1
73.850 57.0
51.919 27.5
84.089 65.0
44.787 19.0
66.299 51.0
27.428 10.0
30.792
6.0
52.927 39.0
54.957 36.5
133.467 73.0
85.230 61.5
52.523 43.5
76.536 58.5
54.805 45.0
22.577
8.5
73.623 60.0
101.641 67.0
30.262 16.0
54.188 36.5
51.327 20.0
78.061 66.0
45.205 34.0
114.968 72.0
35.922 24.0
29.977 14.0
(10)
(11)
r2
r3
rank2
50.0
32.0
58.5
26.0
46.0
5.0
15.0
29.0
42.0
75.0
64.0
23.0
52.0
25.0
1.0
49.0
69.0
4.0
36.5
43.5
47.0
18.0
70.0
3.0
8.5
rank3
54.0
30.0
61.5
21.0
48.0
7.0
13.0
33.0
41.0
74.0
63.0
31.0
55.0
40.0
2.0
53.0
68.0
12.0
36.5
27.5
56.0
22.0
71.0
17.0
11.0
(12)
(13)
The sums of the columns will not fit on the table so they are printed here.
(1)Sum of x1 = 112.51; (2)Sum of x2 = 107.43; (3)Sum of x3 = 109.96;
(4)Sum of x1sq = 530.380; (5)Sum of x2sq = 488.725; (6)Sum of x3sq = 508.253;
(7)Sum of rsum = 329.9; (8)Sum of rmean = 109.967; (9)Sum of rmsq = 508.308;
(10)Sum of rssq = 1527.36;
(11)Sum of rank1 = 1010.5; (12)Sum of rank2 = 892; (13)Sum of rank3 = 947.5.
You are left to find column means and the grand mean. Please avoid Recomputing stuff that I have done
for you. Life is not that long. You will need to get column and overall means. Almost everything else is
done for you.
a) Consider the first three columns to be three independent random samples from Normal distributions with
similar variances. Compare the means using an appropriate statistical test or tests. (6)
b) Actually as in 1e) these data represent three tests of a single random sample of 25 workers. Consider the
data blocked by worker and compare means. (4)
c) Consider the first three columns to be three independent random samples from a distribution that is not
Normal. Compare the medians using an appropriate statistical test or tests. (5) [31]
5
252x0644 12/07/05
(Blank)
6
252x0644 12/07/05
3. A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday
morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use
  .01) . All circulation data is in millions sold.
Row
1
2
3
4
5
6
7
8
9
y
x1
S
54
55
55
56
56
58
58
60
61
AM
32
35
36
40
41
42
43
44
47
x2
x3
PM
34
33
31
29
29
28
26
25
24
T
0
1
2
3
4
5
6
7
8
The quantities below are given:
y  513,
x  360,
n 9,
x


2
2
 7549,
compute
1
 x y  ?,  x
1
 x y as part of a).
2y
x
2
 259,
 14700,
y
x x
1 2
2
 29287,
x
2
1
 14584,
 10230. Yes, you will have to
1
You do not need all of these.
a) Compute a simple regression of Sunday circulation against morning circulation.(8)
b) Compute R 2 (4)
c) Compute s e (3)
d) Compute s b0 ( the std deviation of the intercept) and do a confidence interval for  0 .(3)
e) Do a prediction interval for units when morning circulation rises to 50 million. (3) Why is this interval
likely to be larger than other prediction intervals we might compute for morning circulation we have
actually observed? (1)
[53]
7
252x0644 12/07/05
4. Data from problem 2 is repeated. (Use   .01) .
A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning
circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use   .01) . All
circulation data is in millions sold.
Row
1
2
3
4
5
6
7
8
9
y
x1
S
54
55
55
56
56
58
58
60
61
AM
32
35
36
40
41
42
43
44
47
x2
x3
PM
34
33
31
29
29
28
26
25
24
T
0
1
2
3
4
5
6
7
8
The quantities below are given:
y  513,
x  360,
n 9,



x 22
 7549,
compute
1
 x y  ?,  x
1
 x y as part of a).
2y
x
2
 259,
 14700,
y
x x
1 2
2
 29287,
x
2
1
 14584,
 10230. Yes, you will have to
1
a) Do a multiple regression of Sunday circulation against morning and evening circulation. (12)
b) Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare
the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with
the R 2 from the previous problem.(6)
c) Compute the regression sum of squares and use it in an F test to test the usefulness of this regression. (5)
d) Use your regression to predict the number of units sold when AM circulation is 40 and PM circulation is
25.(2)
e) Use the directions in the outline to make this estimate into a confidence interval and a prediction interval.
(4)
[82]
8
252x0644 12/07/05
5. Data from problem 2 is repeated. (Use   .01) .
A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning
circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use   .01) . All
circulation data is in millions sold.
The time variable is now added with the following results.
MTB >
SUBC>
SUBC>
Regress c1 3 c2 c3 c10;
VIF;
DW.
Regression Analysis: S versus AM, PM, T
The regression equation is
S = 48.8 - 0.163 AM + 0.302 PM + 1.51 T
Predictor
Coef SE Coef
Constant
48.79
22.22
AM
-0.1631
0.2621
PM
0.3024
0.5228
T
1.5081
0.6578
S = 0.658831
R-Sq = 95.3%
Analysis of Variance
Source
DF
SS
Regression
3 43.830
Residual Error
5
2.170
Total
8 46.000
T
P
VIF
2.20 0.080
-0.62 0.561 29.1
0.58 0.588 60.2
2.29 0.070 5
R-Sq(adj) = 92.5%
MS
14.610
0.434
F
33.66
P
0.001
Durbin-Watson statistic = 2.51601
a) What do the significance tests on the coefficients reveal? Give reasons. (2)
b) Can you explain why the coefficient of AM seems unreasonable? What is the apparent reason for this?
(2)
c) Do a 10% two-sided Durbin-Watson test on the result as suggested in class. What is the hypothesis tested
and what is the result? (3)
d) Reuse your spare parts from the previous regression if possible to compute the correlation between AM
and PM circulation and test it for significance. (4)
e) Compute a rank correlation between AM and PM circulation and test it for significance. Can you explain
why it is larger than the correlation in d)? (4)
f) Test the hypothesis that the correlation that you computed in d) is -.99. (4) [101]
g) (Extra credit) If AM, PM and T are x1 , x 2 and x3 , find the partial correlation coefficient (square root
of the coefficient of partial determination) rY 3.12 . (2)
9
252x0644 12/07/05
6. The following times were recorded for 6 skiers on 3 slopes. In order to assess their difficulty we look at
the median time for each slope. We do not assume a Normal distribution. Do not compute the median or
mean time for any slope.
Skier
Slope 1
Slope 2
Slope 3
1
4.7
5.6
4.9
2
4.4
5.6
4.9
3
4.0
5.0
4.7
4
4.3
4.3
4.9
5
4.4
4.5
4.3
6
3.2
3.4
3.7
a) Test the hypothesis that the median time on slope 2 is 5 minutes (3 or 2 depending on method) (3)
b) Test the hypothesis that slope 2 and slope 3 have the same median times. (4)
c) Test the hypothesis that the slopes all have the same median time. (4)
d) Explain what methods you would use in b) and c) if the columns were independent random samples. (1)
e) Rank the skiers times on each slope from 1 (fastest) to 6. Use these as rankings of the skiers and test to
see if the ranks agree between slopes. (4)
[117]
10
252x0644 12/07/05
7. Clarence Sales is a marketing major and knows that national soft drink market shares are as below.
Classic Coke
15.6%
Pepsi
13.2%
Diet Coke
5.1%
Diet Pepsi
3.5%
Other brands
62.6%
He gets in a bit of trouble here and is sentenced to 20 hours of public service. After he finishes his public
service he takes off for Maine, gets caught littering and is sentenced to another 20 hours of public service.
During his public service, he picks up 100 cans in each state. The cans are as below.
Brand
PA
ME
Classic Coke
21
16
Pepsi
15
11
Diet Coke
13
10
Diet Pepsi
6
5
Other brands
45
58
Use a 1% significance level throughout this problem. Don’t waste our time by just computing percents and
saying that they are different. Each problem requires a statistical test or the equivalent. State your null and
alternative hypotheses in each problem.
a) Regard the cans picked up as a random sample of sales in the two states. Can we say that the proportions
of soft drink cans discarded in Maine are the same as the national market shares? (5)
b) Clarence knows that that Maine is Moxie country, so he believes that the proportion of other brands sold
is higher in Maine than in Pennsylvania. Is this true? (4)
c) Create a 2% 2-sided confidence interval for the difference between the proportions of other brands sold
in Maine. Using your Normal table, make this into a 2.5% 2-sided interval. (3)
d) Actually Clarence’s mother owns the Pepsi franchise for Maine and last year between her sales of Pepsi
and Diet Pepsi accounted for 15% of the soft drink market in Maine. She tells Clarence that her sales are
now above 15%. On the basis of Clarence’s Maine sample is that true? (2)
[131]
11
252x0644 12/07/05
(Blank)
12
252x0644 12/07/05
13
Download