Multiple Regression

advertisement
Chapter 9: Multiple Regression
1.
a.
5.117
b.
4.256
c.
3.863
d.
3.633
e.
3.481
a.
0.094
b.
0.075
c.
0.063
d.
0.055
e.
0.049
a.
Yes. The equation is linear with respect to the error term.
b.
No. The equation is not linear with respect to the error term.
2.
3.
4.
c.
Yes. The equation is linear with respect to the error term.
Collinearity is the state in which one or more of the predictor variables are higher correlated with each
other. This means that for the purposes of multiple regression, those variables are conveying much of the
same information.
5.
a.
The correlation matrix and scatterplot matrix for the variables appear as follows:
1
Chapter 9: Multiple Regression
b.
In Repairs vs. Support, there is an outlier with Support of 7.2 and Repairs of 8.8, belonging to
Texas Instruments. In Buy Again vs. Reliability, Texas Instruments is also an outlier, with Buy
Again of 4.5 and Reliability of 8.8. And finally, Texas Instruments is an outlier on Buy Again vs.
Repairs, at 4.5 Buy Again and 8.8 Repairs.
With respect to the willingness to buy again, the high scores belong to mail order companies like
Gateway 2000 and Dell. Big companies like Texas Instruments, AT&T, and IBM score rather low
on willingness to buy again, although Apple, Compaq, Digital Equipment, and Hewlett-Packard
have scores near the middle.
c.
Using the Regression command from the Analysis ToolPak, the resulting regression line is
BuyAgain=1.53(Reliability)–1.14(Repairs)+1.50(Support)–8.76. Surprisingly, higher repair
satisfaction is negatively associated with buying again after including the other factors. The
regression is fairly successful, with an R Squared value of .72 (.69 adjusted). All three predictors’
coefficients are significant at the 5% level.
2
Chapter 9: Multiple Regression
Here are the residual plots:
The plot of residuals vs. predicted values does not show any special pattern. There is no
indication of curvature or unequal variance, and there are no outliers. The normal plot is pretty
straight, with no extreme values, so there is no suggestion of problems with the normality
assumption. This might be surprising, considering that there are outliers in the scatterplot matrix.
Texas Instruments has the most negative residual, and this makes sense because it showed up on
the scatterplot matrix as having very low values of Bug Again relative to the reliability and repair
record.
3
Chapter 9: Multiple Regression
Here are the residual plots of the three predictors:
The individual predictors’ residual plots do not indicate any model failure with respect to the 3
predictor variables. To summarize, the four variables are strongly related, although there are
some outlying points. The large companies do not necessarily score well with PC customers, and
the highest scorers are mail order firms. The three other variables are pretty successful in
predicting willingness to buy again. The diagnostic plots indicate no problems with the
regression assumptions.
6.
a.
The correlation and scatterplot matrices appear as follows:
4
Chapter 9: Multiple Regression
b.
The regression statistics are:
The regression can be counted as a success in the sense that the multiple R2 value is quite high.
The value 0.971 indicates the regression accounts for 97.1% of the variability in Calories. Both
the Carbo and Protein coefficients are close to their known values (which are both four).
However, the coefficient for Fat is 12.248, and this pretty far from the known value of 9.
However the 95% confidence interval for the Fat coefficient does include the known value within
its range.
c.
The coefficient for Fat has a high standard error and inaccuracy as compared to the known value
because Fat amounts are given to the nearest half, and the amounts themselves are small (all 2 or
less). This means that the percentage error can be very large, so it is reasonable to expect that the
Fat coefficient would be inaccurate.
d.
The residual vs. predicted value plot appears as follows:
Wheaties is an outlier, with a predicted calories per serving of 110, but actual calories per serving
of 100. Using the known values, we find that there should be at least 4(3) + 4(23) + 9(1) = 113
calories per serving, considerably higher than the advertised 100. Pretzels have the same amounts
of protein, carbohydrates, and fat, yet state 110 calories per serving. It is possible that the
5
Chapter 9: Multiple Regression
company erred in their measurements, or that the error is due to rounding, or that the company
intentionally understated the total calories.
7.
a.
The coefficient for Fat in this regression is 9.74, much closer to the accepted value of 9. Also, the
standard error is 0.77, as opposed to 4.10 in the previous regression, so the inclusion of the high fat
danish improved the regression a lot.
a.
Using the Regression command from the Analysis ToolPak, the resulting regression model is
calculated to be:
Runs=–2.74+0.44(singles)+0.62(doubles)+2.10(triples)+1.47(homeruns)+0.48(walks).
The complete output from the regression command follows.
8.
The Rosner-Woods coefficients are contained within the 95% confidence interval for the
coefficients from this regression. In other words, the results here are consistent with the results
obtained earlier.
b.
Yes, the Rosner-Woods coefficients make more sense, since one would expect that the home runs
coefficient would be larger than the triple coefficient. Apparently there were lots of runners on
base when triples were hit that day, accounting for the large number of runs scored on triples!
c.
With more data, the coefficients would be close to Rosner-Woods’.
6
Chapter 9: Multiple Regression
9.
a.
The scatterplot matrix and correlation matrix appear as follows:
There is strong negative correlation between Price and Age (–0.873) with a p-value of < 0.001.
There is also a strong negative correlation between Price and Miles (0.702) with a p-value of
0.011.
7
Chapter 9: Multiple Regression
b.
Here is the output from the Analysis ToolPak’s Regression command:
The regression equation is Price = 10246–721(Age)–0.019(Miles); R Squared = .77 (.72 adjusted).
Examining the residuals, you note that there is a slight upward trend to the residuals vs. predicted values plot
which may cause you to believe that the variation in price is completely explained by the model. There is no
indication of non-normality in the normal probability plot of the residuals.
Examining the plots of the residuals vs. predictor variables leads you to conclude that the 12 year-old car has a
price that is an outlier is relation to the price/age comparison of the younger cars. Aside from the 12-year old
car there is a strong negative relationship between the residuals and age.
8
Chapter 9: Multiple Regression
c.
The regression statistics for the reduced data set are:
Price = 11796–1276(Age)–0.022(Miles); Multiple R Squared = .81 (.77 adjusted).
9
Chapter 9: Multiple Regression
There is no indication of model failure in any of the residual plots.
There is a big change in the coefficient for Age between the two models. By omitting the oldest
car the coefficient is nearly doubled–from a price decline of $721/year for the first model to a
decline of $1276/year for the second model. The second model is better since the first model
showed serious flaws in the residual plots. The fact that there are no cars with ages between 6 and
12 also makes it unwise to try to model the price/age relationship during those years since there is
no evidence that the relationship will continue to be linear. Finally, there is a slight increase in
the adjusted R2 value in going from the first model to the second.
d.
The Miles coefficient is not significant in either regression. None of the cars have high mileage
relative to their age. It seems that the mileage was not given in the advertisements for cars with
high mileage, so the mileage given is quite predictable from the age, and the mileage therefore
offers no new information beyond what is known from the age.
a.
The scatterplot matrix and correlation matrix for the log(price) variable appear as follows:
10.
The correlation is stronger: between Log Price and Age, -.970; between Log Price and Miles, .732 than the corresponding correlations with Price.
10
Chapter 9: Multiple Regression
b.
The output from the Analysis ToolPak Regression command is as follows:
Log Price = 9.42–0.169(Age)–0.0000018(Miles)
The residual plots appear as follows:
There does not appear to be any major departures from the regression assumptions. The 12-year
old car does appear as separate from the other models in the Residuals vs. Predicted Values and
Residuals vs. Age plots, but the remaining observations do not appear to demonstrate any slope.
11
Chapter 9: Multiple Regression
With Log Price, the R2 value is 0.94 (0.93 adjusted), an improvement over the 0.88 (0.72
adjusted) value for the Price model.
The correlation is stronger, the R Squared value is higher, and the old car does not have the high
residual, so the regression is much improved. Miles is even more insignificant in this regression.
c.
It is more sensible to use Log Price, since cars depreciate quickly when they are new, and more
slowly when they are older.
To see how the log model works, express Log Price in terms of Age and Miles and exponentiate the
equation: Log Price = 9.42–0.169(Age)–0.000018(Miles) to get:
Price = e9.42–0.169(Age)–0.000018(Miles)=e9.420 e–0.169(Age) e0.0000018(Miles) =12337*0.844Age*0.999998Miles
This estimates that a new car costs $12337 and that the value is multiplied by 0.844 for each year of
Age. In other words, the price drops by 15.6% each year (1–0.844 = 0.156). This makes more
sense that having the price drop by $721 or $1276 each year, because eventually the price will
become negative, and because newer cars lose value faster than older cars.
11.
a.
The correlation matrix and scatterplot matrix appear as follows.
12
Chapter 9: Multiple Regression
All the Pearson probabilities of the correlations are extremely close to 0. The statistical
significance is partly due to the large number of observations involved making it possible to
detect smaller correlations.
b.
The output from the Analysis ToolPak’s Regression command appears as follows:
MPG =–14.54–0.330(Cylinders) + 0.00768(Engine Disp)–0.00039(Horsepower)–0.00679
(Weight) + 0.0853(Acceleration) + 0.753(Year)
13
Chapter 9: Multiple Regression
c.
Weight, Horsepower, and Engine Disp are all pairwise strongly related. This means that changes in
one will cause changes in the other two. Once Weight has been added to the regression model,
there is little “extra” that Horsepower and Engine Disp will add to the model therefore their effects
will appear insignificant.
d.
The plot Residuals vs. Predicted Values appears as follows:
The residuals seem to form a U-shaped curve, indicating that the model has some serious flaws.
Transforming the data may remove the curve.
The regression statistics for the Log MPG variable appears as follows:
e.
Regression Statistics
Multiple R
0.935
R Square
0.874
Adjusted R Square
0.872
Standard Error
0.053
Observations
392
ANOVA
df
Regression
SS
MS
6
7.454955486
1.242493
Residual
385
1.072171984
0.002785
Total
391
8.52712747
Coefficients
Standard
Error
Significance
F
446.1594
6.381E-170
P-value
Lower 95%
Upper 95%
Intercept
0.794
0.073
10.845
0.000
0.650
0.938
Cylinders
-0.010
0.005
-1.982
0.048
-0.020
0.000
Engine Disp
0.000
0.000
1.104
0.270
0.000
0.000
Horsepower
0.000
0.000
-2.047
0.041
-0.001
0.000
Weight
0.000
0.000
-11.151
0.000
0.000
0.000
-0.001
0.002
-0.336
0.737
-0.004
0.003
0.013
0.001
15.930
0.000
0.011
0.014
Accelerate
Year
t Stat
F
Log(MPG) = 0.794 – 0.0101(Cylinders) + 0.00125(Engine Disp) – 0.0044
(Horsepower) – 0.00011(Weight) – 0.00053(Acceleration) + 0.0129 (Year)
14
Chapter 9: Multiple Regression
The plot of Residuals vs. Predicted values appears as:
Residuals vs Predicted Values
0.2
0.15
0.1
Residuals
0.05
0
1
1.1
1.2
1.3
1.4
1.5
1.6
-0.05
-0.1
-0.15
-0.2
Predicted
The transformation has imporoved the scatterplot, though there is still a curvilinear trend
in the plot. The plots of the Residuals vs. each of the Predictor variables appears as
follows:
The model appears to fail for the Weight, Cylinders and Engine Displacement variables.
There does not appear to be any problems with Acceleration, Year or Horsepower.
These problems might be fixed by including powers of the different variables, for
example including Weight2 or (Engine Displacement)2.
15
Chapter 9: Multiple Regression
12.
The following steps are used to remove the variables from the model.
Step
Status
a.
Action
1
The least significant predictor that is
nonsignificant is Acceleration at 0.737
Remove Acceleration
2
The least significant predictor that is
nonsignificant is Engine Disp at 0.251
Remove Eng. Disp.
3
The least significant predictor that is
nonsignificant is Cylinders at 0.106
Remove Cylinders
4
All remaining predictors are significant
Stop
The final regression equation is:
log MPG = 0.76–0.00038(Horsepower)–0.00012(Weight)+0.013(Year).
b.
The R2 vlaue of the full model is 0.874. The R2 of the reduced model is 0.873. By removing 3
variables from the regression model, we have only reduced the R2 value by 0.001.
13.
The regression statistics are:
a.
Regression Statistics
Multiple R
0.940
R Square
0.884
Adjusted R Square
0.881
Standard Error
0.046
Observations
245
ANOVA
df
Regression
SS
MS
6
3.871
0.645
Residual
238
0.510
0.002
Total
244
4.382
Coefficients
Standard
Error
t Stat
F
301.019
P-value
Significance
F
0.000
Lower 95%
Upper
95%
Intercept
0.911
0.084
10.780
0.000
0.744
1.077
Cylinders
-0.018
0.005
-3.488
0.001
-0.028
-0.008
Engine Disp
0.000
0.000
0.218
0.828
0.000
0.000
Horsepower
-0.001
0.000
-2.946
0.004
-0.001
0.000
0.000
0.000
-7.124
0.000
0.000
0.000
-0.008
0.002
-4.346
0.000
-0.011
-0.004
0.012
0.001
13.012
0.000
0.011
0.014
Weight
Accelerate
Year
For the American cars only, the regression model that includes all the predictors variables is:
log(MPG)=0.911–0.0179(Cylinders)+0.0000273(Engine Disp)–0.00062(Horsepower)–
0.000080(Weight)–0.0079(Acceleration)+0.012402(Year).
16
Chapter 9: Multiple Regression
b.
The following residual plots check the regression assumptions:
The residual plots show the model does not fulfill all the assumptions. There are a very outliers
which show up on the normal probability plot. There is some evidence of non-constant variance
in the Residual vs. Predicted Value plot as the spread of the residuals is less for the low and the
high Predicted Values. There is some evidence of curvature in the Year, Weight and Engine
Displacement plots. However the Multiple R2 value is 0.884 which means that this model does
account for 88.4% of the variation in American Log(MPG). We would by no means be contents
with this model in it’s present form as the final model, but it does make for a good starting point
for discussion and is “reasonable” in that context.
c.
The first few values of the column are:
Model
Log MPG
Predicted
Residuals
amc ambassador dpl
1.176091259
1.1536
0.0225
amc gremlin
1.322219295
1.2909
0.0314
amc hornet
1.255272505
1.2725
-0.0172
amc rebel sst
1.204119983
1.1817
0.0225
buick estate wagon (sw)
1.146128036
1.1828
-0.0366
buick skylark 320
1.176091259
1.1568
0.0193
chevrolet chevelle malibu
1.255272505
1.1885
0.0668
chevrolet impala
1.146128036
1.0925
0.0536
chevrolet monte carlo
1.176091259
1.1778
-0.0018
1
1.0517
-0.0517
chevy c20
17
Chapter 9: Multiple Regression
The plot is:
d.
Residuals vs Predicted
0.25
0.2
0.15
Residuals
0.1
0.05
American
European
Japanese
0
1
1.1
1.2
1.3
1.4
1.5
1.6
-0.05
-0.1
-0.15
-0.2
Predicted
e.
The descriptive statistics for the residuals are:
Origin = "American"
Count
Origin = "European"
Origin = "Japanese"
245
68
79
0.0000
1.9497
2.2801
0.000000
0.028672
0.028862
0.0037
0.0224
0.0346
Minimum
-0.1638
-0.1087
-0.1768
Maximum
0.1804
0.2074
0.1852
Range
0.3442
0.3161
0.3619
Standard Deviation
0.0457251
0.0704523
0.0612171
Variance
0.0020908
0.0049635
0.0037475
Standard Error
0.0029213
0.0085436
0.0068875
t statistic (mean = 0)
0.000
3.356
4.191
t statistic p-value
1.000
0.001
0.000
lower 95% c.i.
-0.005754
0.011619
0.015150
upper 95% c.i.
0.005754
0.045725
0.042574
Sum
Average
Median
f.
After adjusting for other factors as determined by the regression equation for American cars,
European cars have a higher Log(MPG) by 0.0116 to 0.0457 points (95% confidence interval). The
95% CI for Japanese cars is (0.0152 , 0.04257). In terms of MPG, this means an increase of 2.7 to
11.1% for European cars and 3.6 to 10.3% for Japanese cars.
18
Chapter 9: Multiple Regression
14.
The scatterplot appears as:
a.
50.0
33
19
8
0
22
7
9 13
19
13 11
21
2726
24
42
22
37
38
35
32
38
45
30.0
4544
44
49
34 31
35.0
31
Longitude
47
31
28
24
24
12
11
14 23
22
2426
24
27
30
25 27
21
21
40.0
14
15
18
45.0
2
33
25.0
58
65
20.0
130.0
120.0
110.0
100.0
90.0
80.0
70.0
60.0
Latitude
The regression statistics are:
b.
Regression Statistics
Multiple R
0.861
R Square
Adjusted R
Square
0.741
Standard Error
6.935
0.731
Observations
56
ANOVA
df
Regression
SS
MS
2
7297.335
3648.667
Residual
53
2548.647
48.088
Total
55
9845.982
Coefficients
Intercept
Long
Lat
c.
Standard
Error
98.645
8.327
0.134
-2.164
t Stat
F
75.875
P-value
Significance
F
2.79154E16
Lower 95%
81.943
Upper 95%
11.846
0.000
115.347
0.063
2.122
0.039
0.007
0.261
0.176
-12.314
0.000
-2.516
-1.811
The significance level for the Latitude predictor variable is < 0.001, while the p-value for the
Longitude predictor variable is 0.039. Both factors are signicant at the 5% level. The R 2 value is
0.741, so 74.1% of the variation in temperature is explained by the regression equation.
19
Chapter 9: Multiple Regression
The residual values on the map plot appear as follows:
d.
50.0
22
9
-4
-10
2
-4
-8 -4
0
-8 -10
-7
-6
-1
0
-2
1
1
2
30.0
1 1
3
35.0
2 1
-3
Longitute
7
0
-6
-11
-13
0
-3
-2 7
4
3 547
4
72
3
-3
11
-2
10
4
40.0
-10
-10
-7
45.0
-10
17
25.0
5
9
20.0
130.0
120.0
110.0
100.0
90.0
80.0
70.0
60.0
Latitude
e.
Cities on the East and West coasts show positive residuals, while those cities in the countries
interior show negative residuals.
a.
The regression statistics are:
15.
Regression Statistics
Multiple R
0.857
R Square
0.735
Adjusted R Square
0.728
Standard Error
19847.671
Observations
117
ANOVA
df
Regression
SS
MS
F
Significance
F
104.397
1.95925E-32
P-value
Lower 95%
3
1.23375E+11
4.11E+10
Residual
113
44514094371
3.94E+08
Total
116
1.67889E+11
Coefficients
Intercept
7530.949
Square Feet
Age
Features
Standard
Error
t Stat
7412.625
1.016
0.312
-7154.793
Upper 95%
22216.691
58.437
3.830
15.258
0.000
50.849
66.025
-374.156
163.497
-2.288
0.024
-698.072
-50.239
2257.974
1445.182
1.562
0.121
-605.192
5121.140
Price = 7530.95 + 58.437 (Sq Feet) – 374.16 (Age) + 2257.97 (Feature)
20
Chapter 9: Multiple Regression
The residual plot is:
b.
Residuals vs. Predicted
100,000
80,000
60,000
40,000
Residuals
20,000
0
50,000
70,000
90,000
110,000
130,000
150,000
170,000
190,000
210,000
230,000
250,000
-20,000
-40,000
-60,000
-80,000
-100,000
-120,000
Predicted
The residual plot indicates a possible violation of the assumption of constant variance in the
residuals. AS the predicted value increases, the spread of the residuals increases as well.
The regression statistics are:
c.
Regression Statistics
Multiple R
0.866
R Square
0.751
Adjusted R Square
0.744
Standard Error
0.070
Observations
117
ANOVA
df
Regression
SS
MS
3
1.680060833
0.56002
Residual
113
0.557939533
0.004938
Total
116
2.238000365
Coefficients
Intercept
Square Feet
Age
Features
Standard
Error
t Stat
113.4214
Significance
F
6.10099E34
P-value
Lower 95%
Upper 95%
4.589
4.693
F
4.641
0.026
176.837
0.000
0.000
0.000
15.787
0.000
0.000
0.000
-0.002
0.001
-2.613
0.010
-0.003
0.000
0.009
0.005
1.765
0.080
-0.001
0.019
21
Chapter 9: Multiple Regression
The residual plot is:
Residuals vs. Predicted
0.300
0.200
Residuals
0.100
0.000
4.700
4.800
4.900
5.000
5.100
5.200
5.300
5.400
5.500
-0.100
-0.200
-0.300
129,500
-0.400
Predicted
The transformation appears to take care of the problem of non-constant variance.
d.
The point belongs to a house that is priced at $129,500. However based on the regression model
using the Log(Prices) variable, this house should be priced at $282,426. Thus, based on the model,
the house is very under-priced.
a.
The scatterplot and trend line appear as follows:
16.
Unemployment vs FRB index
5
4.5
4
FRB Index
3.5
3
2.5
2
1.5
1
100
110
120
130
140
150
160
170
Unemployment
Though the trend line is positive, there is a lot of variability in the data. It is unclear whether
unemployment rises with the FRB index.
22
Chapter 9: Multiple Regression
b.
Unemployment = –0.035 + 0.021 (FRB index)
R2 = 0.098. The regression explains only 9.8% of the variability in unemployment.
c.
After adding Years to the regression, the equation is:
Unemployment = 13.454 – 0.103 (FRB index) + 0.659 (Year)
R2 = 0.866, accounting for 86.6% of the variability.
d.
The parameter for the FRB index has changed sign from one regression to another. Taking these
results at face value, we would come to different conclusions if we ignored the significance of the
regression tests.
e.
The Pearson correlation is 0.906 with a p-value < 0.001. So there is significant correlation between
the FRB index and the Year variable.
Because Year and the FRB index are high correlated, they are essentially providing the same
information to the regression equation. The collinearity makes it difficult to interpret the regression
equation when both are present. Thus the regression equation parameters are highly suspect.
23
25
Download