4/25/03 252x0341 ECO252 QBA2 Name

advertisement
4/25/03 252x0341
ECO252 QBA2
FINAL EXAM
May 6, 2003
Name
Hour of Class Registered (Circle)
I. (18 points) Do all the following. Note that answers without reasons receive no credit.
A researcher wishes to use demographic information to predict sales of a large chain of nationwide sports
stores. The researcher assembles the following data for a random sample of 38 stores. Use   .10 in this
problem.
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Sales
Age
1695713
3403862
2710353
529215
663687
2546324
2787046
612696
891822
1124968
909501
2631167
882973
1078573
844320
1849119
3860007
826574
604683
1903612
2356808
2788572
634878
2371627
2627838
1868116
2236797
1318876
1868098
1695219
2700194
1156050
643858
2188687
830352
1226906
566904
826518
33.1574
32.6667
35.6553
33.0728
35.7585
33.8132
30.9797
30.7843
32.3164
32.5312
31.4400
33.1613
31.8736
33.4072
34.0470
28.8879
36.1056
32.8083
33.0538
33.4996
32.6809
28.5166
32.8945
30.5024
30.2922
31.2911
33.0498
32.9348
31.8381
31.0794
32.1807
31.6944
34.0263
34.7315
30.5613
33.5183
32.3952
29.9108
Growth
Income
0.8299
0.6619
0.9688
0.0821
0.4646
2.1796
1.8048
-0.0569
-0.1577
0.3664
2.2256
1.5158
0.1413
-1.0400
1.6836
2.3596
0.7840
0.1164
1.1498
0.0606
1.6338
1.1256
1.4884
4.7937
1.8922
1.8667
1.7896
0.2707
3.0129
23.4630
0.7041
-0.1569
0.7084
0.1353
0.3848
0.7417
0.6693
0.1111
26748.5
53063.8
36090.1
32058.1
47843.4
50181.0
30710.1
29141.7
25980.2
18730.9
31109.2
35614.1
23038.4
34531.7
30350.4
38964.9
49392.8
25595.7
29622.6
31586.1
39674.6
28879.0
24287.1
46711.2
33449.8
31694.5
25459.2
47047.3
26433.2
33396.7
26179.4
33454.6
42271.5
46514.8
27030.8
42910.1
40561.4
22326.0
HS
73.5949
88.4557
73.5362
79.1780
84.1838
93.4996
78.0234
70.2949
70.6674
63.7395
76.9059
82.9452
65.2127
73.4944
80.2201
87.5973
85.3041
65.5884
80.6176
80.3790
79.8526
81.2371
70.2244
87.1046
80.2057
75.2914
77.6162
85.1753
74.1792
81.6991
73.4140
73.7161
78.6493
80.9503
66.8057
77.8905
79.3622
58.3610
College
17.8350
31.9439
18.6198
20.6284
35.2032
41.7057
28.0250
15.0882
10.9829
13.2458
19.5500
20.8135
16.9796
32.9920
22.3185
24.5670
30.8790
17.4545
18.6356
38.3249
23.7780
16.9300
19.1429
30.8843
26.5570
28.3600
19.2490
35.4994
18.6375
41.1130
17.8566
26.5426
29.8734
24.5374
14.1390
20.8340
19.0309
10.6729
In the data above ‘Sales’ is the total sales in the last month, ‘Age’ is the median customer age,
‘Growth’ is the population growth rate in the last ten years, ‘Income’ is median family income,
‘HS’ is percent of potential customers with a high school diploma, ‘College’ is percent of potential
customers with a college degree.
To start with, the researcher runs ‘sales’ against each independent variable individually with the following
results.
MTB > regress c1 1 c2
Regression Analysis: Sales versus Age
The regression equation is
Sales = 931626 + 21783 Age
Predictor
Constant
Age
Coef
931626
21783
4/25/03 252x0341
SE Coef
2851421
87750
T
0.33
0.25
P
0.746
0.805
S = 919493
R-Sq = 0.2%
R-Sq(adj) = 0.0%
Analysis of Variance
Source
DF
SS
MS
Regression
1 52099324721 52099324721
Residual Error
36 3.04368E+13 8.45467E+11
Total
37 3.04889E+13
Unusual Observations
Obs
Age
Sales
17
36.1
3860007
22
28.5
2788572
Fit
1718106
1552797
F
0.06
SE Fit
353724
376045
P
0.805
Residual
2141901
1235775
St Resid
2.52R
1.47 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
MTB > regress c1 1 c3
Regression Analysis: Sales versus Growth
The regression equation is
Sales = 1595571 + 26834 Growth
Predictor
Constant
Growth
Coef
1595571
26834
S = 914467
SE Coef
161301
39601
R-Sq = 1.3%
T
9.89
0.68
P
0.000
0.502
R-Sq(adj) = 0.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
1 3.83946E+11 3.83946E+11
36 3.01050E+13 8.36249E+11
37 3.04889E+13
Unusual Observations
Obs
Growth
Sales
17
0.8
3860007
30
23.5
1695219
Fit
1616609
2225167
F
0.46
SE Fit
151819
878449
P
0.502
Residual
2243398
-529948
St Resid
2.49R
-2.09RX
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
MTB > regress c1 1 c4
Regression Analysis: Sales versus Income
The regression equation is
Sales = 299877 + 39.2 Income
Predictor
Constant
Income
Coef
299877
39.17
S = 849860
SE Coef
554447
15.71
R-Sq = 14.7%
T
0.54
2.49
P
0.592
0.017
R-Sq(adj) = 12.3%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
1 4.48747E+12 4.48747E+12
36 2.60014E+13 7.22262E+11
37 3.04889E+13
Unusual Observations
Obs
Income
Sales
17
49393
3860007
Fit
2234579
SE Fit
276038
F
6.21
P
0.017
Residual
1625428
St Resid
2.02R
R denotes an observation with a large standardized residual
4/25/03 252x0341
2
MTB > regress c1 1 c5
Regression Analysis: Sales versus HS
The regression equation is
Sales = - 2969741 + 59660 HS
Predictor
Constant
HS
Coef
-2969741
59660
S = 802004
SE Coef
1370956
17669
R-Sq = 24.1%
T
-2.17
3.38
P
0.037
0.002
R-Sq(adj) = 21.9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
1 7.33335E+12 7.33335E+12
36 2.31556E+13 6.43210E+11
37 3.04889E+13
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
38
58.4
826518
Fit
2119509
512081
F
11.40
SE Fit
192928
358068
P
0.002
Residual
1740498
314437
St Resid
2.24R
0.44 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
MTB > regress c1 1 c6
Regression Analysis: Sales versus College
The regression equation is
Sales = 789847 + 35854 College
Predictor
Constant
College
Coef
789847
35854
S = 871330
SE Coef
439508
17582
R-Sq = 10.4%
T
1.80
2.04
P
0.081
0.049
R-Sq(adj) = 7.9%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
1 3.15714E+12 3.15714E+12
36 2.73318E+13 7.59216E+11
37 3.04889E+13
Unusual Observations
Obs
College
Sales
6
41.7
2546324
17
30.9
3860007
Fit
2285170
1896988
SE Fit
347197
189865
F
4.16
P
0.049
Residual
261154
1963020
St Resid
0.33 X
2.31R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
1. On the basis of the material above the researcher decides that ‘HS’ is the best single predictor of sales.
Please explain why. Consider the values of R 2 and the significance tests on the slope of the equation.
According to the equation showing the response of sales to HS, how much will sales rise if there is a 1%
increase in people with a high school diploma. The average store has sales of $1638487. Relative to this,
what percent increase in sales would be caused by a 1 (per cent) increase in ‘HS’. (4)
4/25/03 252x0341
3
2. The researcher tries to improve the prediction by adding another variable. Since there were 4 other
variables than ‘HS,’ there are four regressions below. Do any of them represent an improvement on ‘HS’
alone? Why? Look at the significance tests on the coefficients of the new variables and the adjusted R 2 . In
order to put this in perspective, the average values of the independent variables are shown below.
Age
Growth
Income
College
HS
32.450
1.599
34175
23.67
77.24
Take the best of the four regressions below and give the value of sales that would be predicted for a store
with average value of the independent variables and explain by what percent sales would rise if ‘HS’ went
up by 1. How much does this differ from the prediction using ‘HS’ alone. (4)
MTB > regress c1 2 c5 c2
Regression Analysis: Sales versus HS, Age
The regression equation is
Sales = - 2126081 + 60953 HS - 29076 Age
Predictor
Constant
HS
Age
Coef
-2126081
60953
-29076
S = 811809
SE Coef
2678378
18226
78952
R-Sq = 24.3%
T
-0.79
3.34
-0.37
P
0.433
0.002
0.715
R-Sq(adj) = 20.0%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
HS
Age
DF
SS
MS
2 7.42273E+12 3.71137E+12
35 2.30662E+13 6.59034E+11
37 3.04889E+13
F
5.63
P
0.008
DF
Seq SS
1 7.33335E+12
1 89382656452
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
Fit
2023658
SE Fit
325389
Residual
1836350
St Resid
2.47R
R denotes an observation with a large standardized residual
MTB > regress c1 2 c5 c3
Regression Analysis: Sales versus HS, Growth
The regression equation is
Sales = - 2959336 + 59494 HS + 1506 Growth
Predictor
Constant
HS
Growth
Coef
-2959336
59494
1506
S = 813360
SE Coef
1412551
18355
36079
R-Sq = 24.1%
T
-2.10
3.24
0.04
P
0.043
0.003
0.967
R-Sq(adj) = 19.7%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
2 7.33450E+12 3.66725E+12
35 2.31544E+13 6.61555E+11
37 3.04889E+13
F
5.54
P
0.008
4
4/25/03 252x0341
Source
HS
Growth
DF
Seq SS
1 7.33335E+12
1 1152089260
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
30
81.7
1695219
Fit
2116944
1936614
SE Fit
205088
786380
Residual
1743063
-241395
St Resid
2.21R
-1.16 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
MTB > regress c1 2 c5 c4
Regression Analysis: Sales versus HS, Income
The regression equation is
Sales = - 3089379 + 62540 HS - 3.0 Income
Predictor
Constant
HS
Income
Coef
-3089379
62540
-3.01
S = 813216
SE Coef
1715239
30098
25.26
R-Sq = 24.1%
T
-1.80
2.08
-0.12
P
0.080
0.045
0.906
R-Sq(adj) = 19.7%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
HS
Income
DF
SS
MS
2 7.34273E+12 3.67136E+12
35 2.31462E+13 6.61320E+11
37 3.04889E+13
F
5.55
P
0.008
DF
Seq SS
1 7.33335E+12
1 9375516635
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
Fit
2096954
SE Fit
272313
Residual
1763053
St Resid
2.30R
R denotes an observation with a large standardized residual
MTB > regress c1 2 c5 c6
Regression Analysis: Sales versus HS, College
The regression equation is
Sales = - 3193739 + 64448 HS - 6161 College
Predictor
Constant
HS
College
Coef
-3193739
64448
-6161
S = 812572
SE Coef
1627759
25486
23343
R-Sq = 24.2%
T
-1.96
2.53
-0.26
P
0.058
0.016
0.793
R-Sq(adj) = 19.9%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
HS
College
DF
SS
MS
2 7.37935E+12 3.68967E+12
35 2.31096E+13 6.60273E+11
37 3.04889E+13
F
5.59
P
0.008
DF
Seq SS
1 7.33335E+12
1 45998746314
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
Fit
2113692
SE Fit
196709
Residual
1746316
St Resid
2.22R
R denotes an observation with a large standardized residual
5
4/25/03 252x0341
3. In desperation the researcher tries to add all the variables at once.
a. What does the ANOVA show? (2)
b. Do any of the coefficients have a wrong sign? (Remember there is nothing wrong with a
negative coefficient unless you can give a reason why it shouldn’t be negative) (1)
c. Which of the coefficients are significant? (2)
d. Do an F test to show if addition of all the variables improved the regression. To do this drop a
few zeros. Take the Regression Sum of squares in the regression with ‘HS’ alone as 7.333, the
regression sum of squares after adding all the new variables as 7.454 and the error sum of squares
as 23.303. I’m getting this from the ANOVA table below and the sequential SS table
below it by
12
dividing all the SS’s by 10 since only their relative size matters. (3)
e. To put the results in perspective try again to predict the sales that a store with the mean values
of the independent variables would have and what percent increase in sales would come from an
increase of 1 in ‘HS.’ How does this compare with our prediction when we used ‘HS’ alone?
f. The column marked VIF (variance inflation factor) is a test for (multi)collinearity. The rule of
thumb is that if any of these exceeds 5, we have a multicollinearity problem. None does. What is
multicollinearity and why am I worried about it? (2)
MTB > regress c1 5 c5 c2 c6 c4 c3;
SUBC> vif.
Regression Analysis: Sales versus HS, Age, College, Income, Growth
The regression equation is
Sales = - 2270706 + 62735 HS - 27384 Age - 5702 College + 2.4 Income + 2084 Growth
Predictor
Constant
HS
Age
College
Income
Growth
Coef
-2270706
62735
-27384
-5702
2.45
2084
S = 848433
SE Coef
3696533
35090
93046
28359
30.53
44098
R-Sq = 24.4%
T
-0.61
1.79
-0.29
-0.20
0.08
0.05
P
0.543
0.083
0.770
0.842
0.937
0.963
VIF
3.5
1.3
2.7
3.8
1.4
R-Sq(adj) = 12.6%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
HS
Age
College
Income
Growth
DF
SS
MS
5 7.45407E+12 1.49081E+12
32 2.30348E+13 7.19839E+11
37 3.04889E+13
F
2.07
P
0.095
DF
Seq SS
1 7.33335E+12
1 89382656452
1 26200610077
1 3524785887
1 1608397623
6
4/25/03 252x0341
Unusual Observations
Obs
HS
Sales
17
85.3
3860007
30
81.7
1695219
Fit
2038662
1899886
SE Fit
360453
826437
Residual
1821346
-204667
St Resid
2.37R
-1.07 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
7
4/25/03 252x0341
II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where
applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing
appropriate statistical tests.
1. (Berenson et. al. 1220) A firm believes that less than 15% of people remember their ads. A survey is
taken to see what recall occurs with the following results (In these problems calculating proportions won’t
help you unless you do a statistical test):
Medium
Mag
TV
Radio Total
Remembered
25
10
7
42
Forgot
73
93
108
274
Total
98
103
115
316
a. Test the hypothesis that the recall rate is less than 15% by using proportions calculated from the
‘Total’ column. Find a p-value for this result. (5)
b. Test the hypothesis that the proportion recalling was lower for Radio than TV. (4)
c. Test to see if there is a significant difference in the proportion that remembered according to the
medium. (6)
d. The Marascuilo procedure says that if (i) equality is rejected in c) and
 
(ii) p 2  p3   2 s p , where the chi – squared is what you used in c) and the standard deviation is
2
what you would use in a confidence interval solution to b), you can say that you have a significant
difference between TV and Radio. Try it! (5)
8
4/25/03 252x0341
2. (Berenson et. al. 1142) A manager is inspecting a new type of battery. These are subjected to 4 different
pressure levels and their time to failure is recorded. The manager knows from experience that such data is
not normally distributed. Ranks are provided.
PRESSURE
Use
low
1
2
3
4
5
8.0
8.1
9.2
9.4
11.7
rank normal
11
12
15
16
19
7.6
8.2
9.8
10.9
12.3
rank
high
rank
whee!
rank
8
13
17
18
20
6.0
6.3
7.1
7.7
8.9
4
5
7
9
14
5.1
5.6
5.9
6.7
7.8
1
2
3
6
10
a. At the 5% level analyze the data on the assumption that each column represents a random
sample. Do the column medians differ? (5)
b. Rerank the data appropriately and repeat a) on the assumption that the data is non-normal but
cross classified by use. (5)
c. This time I want to compare high pressure (H) against low - moderate pressure (L). I will write
out the numbers 1-20 and label them according to pressure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
H H H H H H H L H
H L L L H L L L L L L
Do a runs test to see if the H’s and L’s appear randomly. This is called a Wald-Wolfowitz test for the
equality of means in two nonnormal samples. Null hypothesis is that the sequence is random and the means
are equal. What is your conclusion? (5)
9
4/25/03 252x0341
3. A researcher studies the relationship of numbers of subsidiaries and numbers of parent companies in 11
metropolitan areas and finds the following:
Area
parents
1
2
3
4
5
6
7
8
9
10
11
653
391
352
261
226
218
202
151
141
138
134
2867
x
subsidiaries
y
2607
1714
1857
1228
880
671
1524
889
482
569
662
13083
x2
xy
426409
152881
123904
68121
51076
47524
40804
22801
19881
19044
17956
990401
1702371
670174
653664
320508
198880
146278
307848
134239
67962
78522
88708
4369154
y2
6796449
2937796
3448449
1507984
774400
450241
2322576
790321
232324
323761
438244
20022545
a. Do Spearman’s rank correlation between x and y and test it for significance (6)
b. Compute the sample correlation between x and y and test it for significance (6)
c. Compute the sample standard deviation of x and test to see if it equals 200 (4)
10
4/25/03 252x0341
4. Data from the previous page is repeated:
Area
parents
1
2
3
4
5
6
7
8
9
10
11
653
391
352
261
226
218
202
151
141
138
134
2867
x
subsidiaries
y
2607
1714
1857
1228
880
671
1524
889
482
569
662
13083
x2
xy
426409
152881
123904
68121
51076
47524
40804
22801
19881
19044
17956
990401
1702371
670174
653664
320508
198880
146278
307848
134239
67962
78522
88708
4369154
y2
6796449
2937796
3448449
1507984
774400
450241
2322576
790321
232324
323761
438244
20022545
a. Test the hypothesis that the correlation between x and y is .8 (5)
b. Test the hypothesis that x has the Normal distribution. (9)
c. Test the hypothesis that x and y have equal variances. (4)
11
4/25/03 252x0341
5. Data from the previous page is repeated:
Area
parents
1
2
3
4
5
6
7
8
9
10
11
653
391
352
261
226
218
202
151
141
138
134
2867
x
subsidiaries
y
2607
1714
1857
1228
880
671
1524
889
482
569
662
13083
x2
xy
426409
152881
123904
68121
51076
47524
40804
22801
19881
19044
17956
990401
1702371
670174
653664
320508
198880
146278
307848
134239
67962
78522
88708
4369154
y2
6796449
2937796
3448449
1507984
774400
450241
2322576
790321
232324
323761
438244
20022545
a. Compute a simple regression of subsidiaries against parents as the independent variable. (5)
b. Compute s e . (3)
c. Predict how many subsidiaries will appear in a city with 50 parent corporations. (1)
d. Make your prediction in c) into a confidence interval. (3)
e. Compute s b0 and make it into a confidence interval for  0 . (3)
f. Do an ANOVA for this regression and explain what it says about 1 . (3)
12
4/25/03 252x0341
6. A chain has the following data on prices, promotion expenses and sales of one product (You can do
x1 x 2 ):

Store
1
2
3
4
5
6
7
8
9
10
11
12
sales
promotion
x1
x2
x12
4141
3754
5000
4011
3224
2618
3746
3825
1096
1882
2159
2927
38383
59
59
59
59
79
79
79
79
99
99
99
99
948
200
400
600
600
200
400
600
600
200
400
400
600
5200
3481
3481
3481
3481
6241
6241
6241
6241
9801
9801
9801
9801
78092
y2
x 22
Store
1
2
3
4
5
6
7
8
9
10
11
12
price
y
40000
160000
360000
360000
40000
160000
360000
360000
40000
160000
160000
360000
2560000
x1 y
17147881
14092516
25000000
16088121
10394176
6853924
14032516
14630625
1201216
3541924
4661281
8567329
136211509
244319
221486
295000
236649
254696
206822
295934
302175
108504
186318
213741
289773
2855417
x2 y
828200
1501600
3000000
2406600
644800
1047200
2247600
2295000
219200
752800
863600
756200
17562800
y  3198.58, x1  79.0000 and x 2  433.333.
a. Do a multiple regression of sales against x1 and x 2 . (10)
b. Compute R 2 and R 2 adjusted for degrees of freedom. Use a regression ANOVA to test the usefulness
of this regression. (6)
d. Use your regression to predict sales when price is 79 cents and promotion expenses are $200. (2)
e. Use the directions in the outline to make this estimate into a confidence interval and a prediction interval.
(4)
f. If the regression of Price alone had the following output: The regression equation is
sales = 7564 - 55.3 price
Predictor
Constant
price
S = 605.6
Coef
7564.3
-55.26
SE Coef
863.6
10.71
R-Sq = 72.7%
T
8.76
-5.16
P
0.000
0.000
R-Sq(adj) = 70.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
10
11
SS
9772621
3667664
13440285
MS
9772621
366766
F
26.65
P
0.000
Do an F-test to see if adding x 2 helped. (4). The next page is blank – please show your work.
4/25/03 252x0341(blank)
13
14
4/25/03 252x0341
7. The Lees present the following data on college students summer wages vs. years of work experience
blocked by location.
Years of Work Experience
Region
1
2
3
1
16
19
24
2
21
20
21
3
18
21
22
4
14
21
26
a. Do a 2-way ANOVA on these data and explain what hypotheses you test and what the
conclusions are. (9) (Or do a 1-way ANOVA for 6 points.) The following column sums are done for you:
x
1
 69,
x
2
 81, n1  4, n 2  4,
x
2
1
 1217 and
x
2
2
 1643. So x1  17.25,and x 2  20.25.
b. Do a test of the equality of the means in columns 1 and 3 assuming that the columns are random
samples from Normal populations with equal variances (4).
c. Assume that columns 1 and 3 do not come from a Normal distribution and are not paired data
and do a test for equal medians. (4)
d. Test the following data for uniformity. n  20. (6)
Category
1
2
3
4
5
Numbers
0
2
0
10
8
15
Download